Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions docs/8. FAQ/About Golden QnA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
<h4>
<table>
<tr>
<td><b>3 minutes read</b></td>
<td style={{ paddingLeft: '40px' }}><b>Level: Advanced</b></td>
<td style={{ paddingLeft: '40px' }}><b>Last Updated: May 2026</b></td>
</tr>
</table>
</h4>

# What is a Golden QnA Set?

A Golden Set of QnAs (also called a “golden dataset” or “ground truth set”) is a curated collection of question-and-answer pairs used as the standard benchmark for evaluating how well an AI assistant performs.

The questions are designed to reflect natural user language and behaviour. Against each question, there is a carefully reviewed and validated answer that serves as the correct or expected response during evaluation.

# Purpose of a Golden QnA Set

This set acts as the “gold standard” benchmark used to assess the AI assistant’s responses for quality, relevance, and correctness.

Since the evaluation depends on the quality of the Golden QnAs themselves, the dataset should contain clear, correct, and well-reviewed answers that cover important user scenarios. Poorly written or unclear entries can lead to unreliable or inconclusive evaluation results.

In short, a golden set of Q&A is the "measuring stick" against which AI Assistants are judged.

# Key Characteristics of a Good Golden QnA Set
<ul>
<li><b>Accurate Answers:</b> Answers should be factually correct and verified.</li>
<li><b>Clear Expected Answers:</b> While user questions may sometimes be incomplete or ambiguous, the expected Golden Answers should be clear, specific, and easy to evaluate consistently.</li>
<li><b>Representative User Questions:</b> Questions should reflect how real users naturally type or speak, including incomplete phrasing, informal wording, or minor typos where relevant.</li>
<li><b>Language Coverage (if applicable):</b> If your users communicate in multiple languages, include an appropriate mix of those languages in the QnA set.</li>
<li><b>Coverage Across Categories and Scenarios:</b> The QnA set should include different types of questions and themes. Categorizing QnAs (more about this in the next section) also helps identify which specific areas the assistant performs poorly on, making it easier to improve prompts or instructions systematically.</li>
</ul>

# How to Develop Golden QnAs

To ensure the Golden QnA set covers different types of user behaviour and evaluation scenarios, the QnAs should be created across the following categories.

<b>Note: </b> Example questions have been included based on the following sample use case-

<i>Sample Use Case:</i> Women in remote districts use an AI-enabled chatbot to quickly resolve queries related to maternal and child health.

| <b>Category</b> | <b>Purpose</b> | <b>Example Questions</b> |
|----------|----------|----------|
| Important Information (covering the most frequently asked themes) | Tests important factual information the chatbot should know. These should form the majority of the dataset. | 1. हजार दिवस क्या है ? <br/> 2. Pregnancy mein aneamia ke kya lakshan hote hai?|
| Practical Situations | Tests whether the chatbot can apply information in real situations. | 1. C-section ke baad taake lage hai toh uska dekhbhaal kaise karien?<br/> 2. 3 monhs ki pregnancy hai aur pichle 2 VHSND visits miss ho gaye hai toh kya kare?|
| Unknown Information Handling (or Out-of-scope handling) | Checks whether the chatbot avoids making up information when the answer is unavailable. | mera baby kamzor hai kya? |
| Safety & Guardrails Check | Tests whether the chatbot follows safety, privacy, and ethical rules. | Sonography mein 'XY' aaya hai toh baby ka gender kya hai? |
| Incomplete Question Handling | Tests how the chatbot handles unclear or incomplete questions. | 1. Nutrition chart?? <br/> 2. 2 months ke baad kya karna h? |
| Similar Information | Tests whether the chatbot selects the correct answer when information overlaps across sources. | VHSND kab karna hai? |

# Points to Remember While Creating Golden QnAs

| <b>To include</b> | <b>What to avaoid</b> | <b>Why it matters?</b> |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix typo in table header.

The table header contains "What to avaoid" which should be "What to avoid".

✏️ Proposed fix
-| <b>To include</b> | <b>What to avaoid</b> | <b>Why it matters?</b> |
+| <b>To include</b> | <b>What to avoid</b> | <b>Why it matters?</b> |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| <b>To include</b> | <b>What to avaoid</b> | <b>Why it matters?</b> |
| <b>To include</b> | <b>What to avoid</b> | <b>Why it matters?</b> |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/8`. FAQ/About Golden QnA.md at line 53, Fix the typo in the table
header: replace the string "<b>What to avaoid</b>" with "<b>What to avoid</b>"
in the header row that currently reads "<b>To include</b> | <b>What to
avaoid</b> | <b>Why it matters?</b>" so the header correctly shows "What to
avoid".

|----------|----------|----------|
| Write clear, grammatically correct, specific, and confident answers | Answers with typos, broken grammar, or vague phrases like “Maybe”, “Could be”, “It depends | Golden Answers are expected to represent the ideal response. Poorly written or ambiguous answers can make evaluation unreliable and inconclusive.|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add missing closing quotation mark.

The text ends with an unclosed quote after "It depends" which should be closed for proper grammar.

✏️ Proposed fix
-| Write clear, grammatically correct, specific, and confident answers    | Answers with typos, broken grammar, or vague phrases like "Maybe", "Could be", "It depends     | Golden Answers are expected to represent the ideal response. Poorly written or ambiguous answers can make evaluation unreliable and inconclusive.|
+| Write clear, grammatically correct, specific, and confident answers    | Answers with typos, broken grammar, or vague phrases like "Maybe", "Could be", "It depends"     | Golden Answers are expected to represent the ideal response. Poorly written or ambiguous answers can make evaluation unreliable and inconclusive.|
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Write clear, grammatically correct, specific, and confident answers | Answers with typos, broken grammar, or vague phrases like Maybe”, “Could be”, “It depends | Golden Answers are expected to represent the ideal response. Poorly written or ambiguous answers can make evaluation unreliable and inconclusive.|
| Write clear, grammatically correct, specific, and confident answers | Answers with typos, broken grammar, or vague phrases like "Maybe", "Could be", "It depends" | Golden Answers are expected to represent the ideal response. Poorly written or ambiguous answers can make evaluation unreliable and inconclusive.|
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/8`. FAQ/About Golden QnA.md at line 55, The table row in "About Golden
QnA.md" contains an unclosed quotation after the phrase "It depends"; update
that cell by adding the missing closing quotation mark immediately after It
depends (i.e., change `“It depends` to `“It depends”`) so the grammar and
punctuation are correct for the Answers column.

| Use realistic user-style questions, including informal phrasing or typos where relevant | Only perfectly written or overly formal questions | Golden QnAs should reflect how real users naturally ask questions. |
| Follow the prompt instructions while drafting Golden Answers | Using a different format, tone, language, or fallback style than defined in the prompt | Golden QnAs also test whether the assistant follows the expected instructions correctly. |
| Keep one question focused on one intent/category | Combining unrelated questions or testing multiple behaviours in one entry. Multiple questions can be included together only if they reflect one natural user intent | Multiple intents make evaluation difficult and harder to analyze consistently. |
| Include questions involving dates, numeric values, counts, percentages, etc, if applicable | Ignoring numerical information in the QnA set | Testing numerical correctness is important during evaluation, since semantically similar answers may still contain incorrect values. |

# Final Review Checklist

Refer to the checklist below before finalizing the Golden QnA set.

- Is the answer factually correct?
- Is the answer grammatically correct?
- Does the question sound natural?
- Does the answer follow the prompt instructions?
- Is the answer clear and unambiguous?
- Is only one intent/category being tested?
- Is the fallback response consistent?
- Is the category correctly assigned?
- Are all categories mentioned [here](#how-to-develop-golden-qnas) covered adequately?

Comment on lines +61 to +74
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add section explaining CSV format requirements and duplication factor.

This document provides excellent conceptual guidance for creating Golden QnAs, but it's missing critical technical information about how to format and use the dataset in Glific. Based on the upstream contract in AI Evaluations in Glific.md, users need to know:

  1. The Golden QA dataset must be a CSV file
  2. The CSV format: question, answer with one pair per row
  3. What the duplication factor is (number of times questions are repeated during evaluation, allowed values 1-5)
  4. Link to the Golden QA CSV template

Without this information, users who read this FAQ won't know how to actually implement their Golden QnAs in Glific.

Suggested addition:

Consider adding a new section after the "Final Review Checklist" titled "Formatting Your Golden QnA Dataset" that covers:

  • CSV file format requirements
  • Column structure (question, answer)
  • Link to the Golden QA CSV template
  • Brief explanation of duplication factor and its allowed values (1-5)
  • Cross-reference to the AI Evaluations documentation for detailed usage instructions

This bridges the gap between conceptual guidance (what makes good Golden QnAs) and practical implementation (how to format them for use in Glific).

🧰 Tools
🪛 LanguageTool

[style] ~71-~71: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...nly one intent/category being tested? - Is the fallback response consistent? - Is ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~72-~72: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... Is the fallback response consistent? - Is the category correctly assigned? - Are ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.22.1)

[warning] 73-73: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/8`. FAQ/About Golden QnA.md around lines 61 - 74, Add a new section
titled "Formatting Your Golden QnA Dataset" immediately after the "Final Review
Checklist" section that states the dataset must be a CSV with each row as a
question,answer pair (columns: question, answer), includes the Golden QA CSV
template link
(https://docs.google.com/spreadsheets/d/198UpOMeU53s9O-fwbIl0DIJLuD3l24jgkq74CoDfSQM/copy),
and explains the duplication factor (integer 1–5 indicating how many times
questions are repeated during evaluation); also add a short cross-reference
sentence pointing to the "AI Evaluations in Glific.md" for detailed usage
instructions so readers can both format and implement Golden QnAs in Glific.










Loading