feat(ai): Study Buddy golden set + eval — Phase 6 complete (AI-039) by mrviduus · Pull Request #331 · mrviduus/textstack

mrviduus · 2026-06-15T14:54:28Z

Phase 6 AI-039 — the DoD gate for the Study Buddy agent (judge ≥4/5, avg steps ≤4, cost <$0.05). This closes Phase 6: agent loop → tools → persistence → SSE endpoint → reader UI → eval.

Changes

StudyBuddyEvalRunner (Ai.EvalSuite) — per golden: run the real agent against a real edition (its tools hit the live corpus), score the final answer against the reference with the same MEAI RubricEvaluator the rest of the suite uses (axes: correctness / grounding / clarity), and record iterations + cost. A budget-exhausted run is a failed case (judge 0, capped steps). Persists a studybuddy EvalRun (judge mean 1–5; avgSteps + avgCostUsd in the breakdown).
Golden set studybuddy.json (embedded) — 10 starter DDIA "confusing passage → reference explanation" cases; a starter to curate to the DoD's 25 against the live edition (align chapter numbers to it).
POST /admin/ai-quality/evals/studybuddy/run?editionId=&judge= — admin-triggered against a real embedded edition (503 when the judge isn't configured). The agent's tools resolve scoped services from the request scope.

Verification

StudyBuddyEvalRunner with a direct-answering fake agent (no tools, one iteration) + fixed judge: scores the whole golden set, aggregates the judge mean, computes avg steps (1.0) + cost deterministically. Golden count read from the dataset (survives growth to 25).
Full TextStack.UnitTests (272) + TextStack.AiEvals (29) green; solution builds; dotnet format clean.

After merge (closes the phase)

Run the admin eval on prod against the DDIA edition → confirm judge ≥4/5, avg steps ≤4, cost <$0.05; curate the golden set to 25. The agent is already live end-to-end in the reader (#330).

🤖 Generated with Claude Code

StudyBuddyEvalRunner runs the real agent over golden passages against a real edition, judges each answer (RubricEvaluator: correctness/grounding/clarity) and records steps+cost; budget-exhausted = failed case. Persists studybuddy EvalRun (judge mean + avgSteps/avgCost breakdown). Embedded studybuddy.json (10 starter DDIA passages, curate to 25). POST /admin/ai-quality/evals/ studybuddy/run?editionId=&judge=. Closes Phase 6. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mrviduus merged commit 9903074 into main Jun 15, 2026
5 checks passed

mrviduus deleted the ai-039-studybuddy-eval branch June 15, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): Study Buddy golden set + eval — Phase 6 complete (AI-039)#331

feat(ai): Study Buddy golden set + eval — Phase 6 complete (AI-039)#331
mrviduus merged 1 commit into
mainfrom
ai-039-studybuddy-eval

mrviduus commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented Jun 15, 2026

Changes

Verification

After merge (closes the phase)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant