Skip to content

feat(ai): Study Buddy golden set + eval — Phase 6 complete (AI-039)#331

Merged
mrviduus merged 1 commit into
mainfrom
ai-039-studybuddy-eval
Jun 15, 2026
Merged

feat(ai): Study Buddy golden set + eval — Phase 6 complete (AI-039)#331
mrviduus merged 1 commit into
mainfrom
ai-039-studybuddy-eval

Conversation

@mrviduus

Copy link
Copy Markdown
Owner

Phase 6 AI-039 — the DoD gate for the Study Buddy agent (judge ≥4/5, avg steps ≤4, cost <$0.05). This closes Phase 6: agent loop → tools → persistence → SSE endpoint → reader UI → eval.

Changes

  • StudyBuddyEvalRunner (Ai.EvalSuite) — per golden: run the real agent against a real edition (its tools hit the live corpus), score the final answer against the reference with the same MEAI RubricEvaluator the rest of the suite uses (axes: correctness / grounding / clarity), and record iterations + cost. A budget-exhausted run is a failed case (judge 0, capped steps). Persists a studybuddy EvalRun (judge mean 1–5; avgSteps + avgCostUsd in the breakdown).
  • Golden set studybuddy.json (embedded) — 10 starter DDIA "confusing passage → reference explanation" cases; a starter to curate to the DoD's 25 against the live edition (align chapter numbers to it).
  • POST /admin/ai-quality/evals/studybuddy/run?editionId=&judge= — admin-triggered against a real embedded edition (503 when the judge isn't configured). The agent's tools resolve scoped services from the request scope.

Verification

  • StudyBuddyEvalRunner with a direct-answering fake agent (no tools, one iteration) + fixed judge: scores the whole golden set, aggregates the judge mean, computes avg steps (1.0) + cost deterministically. Golden count read from the dataset (survives growth to 25).
  • Full TextStack.UnitTests (272) + TextStack.AiEvals (29) green; solution builds; dotnet format clean.

After merge (closes the phase)

Run the admin eval on prod against the DDIA edition → confirm judge ≥4/5, avg steps ≤4, cost <$0.05; curate the golden set to 25. The agent is already live end-to-end in the reader (#330).

🤖 Generated with Claude Code

StudyBuddyEvalRunner runs the real agent over golden passages against a real
edition, judges each answer (RubricEvaluator: correctness/grounding/clarity)
and records steps+cost; budget-exhausted = failed case. Persists studybuddy
EvalRun (judge mean + avgSteps/avgCost breakdown). Embedded studybuddy.json
(10 starter DDIA passages, curate to 25). POST /admin/ai-quality/evals/
studybuddy/run?editionId=&judge=. Closes Phase 6.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mrviduus mrviduus merged commit 9903074 into main Jun 15, 2026
5 checks passed
@mrviduus mrviduus deleted the ai-039-studybuddy-eval branch June 15, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant