feat(ai): Study Buddy golden set + eval — Phase 6 complete (AI-039)#331
Merged
Conversation
StudyBuddyEvalRunner runs the real agent over golden passages against a real edition, judges each answer (RubricEvaluator: correctness/grounding/clarity) and records steps+cost; budget-exhausted = failed case. Persists studybuddy EvalRun (judge mean + avgSteps/avgCost breakdown). Embedded studybuddy.json (10 starter DDIA passages, curate to 25). POST /admin/ai-quality/evals/ studybuddy/run?editionId=&judge=. Closes Phase 6. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 6 AI-039 — the DoD gate for the Study Buddy agent (judge ≥4/5, avg steps ≤4, cost <$0.05). This closes Phase 6: agent loop → tools → persistence → SSE endpoint → reader UI → eval.
Changes
StudyBuddyEvalRunner(Ai.EvalSuite) — per golden: run the real agent against a real edition (its tools hit the live corpus), score the final answer against the reference with the same MEAIRubricEvaluatorthe rest of the suite uses (axes: correctness / grounding / clarity), and record iterations + cost. A budget-exhausted run is a failed case (judge 0, capped steps). Persists astudybuddyEvalRun(judge mean 1–5;avgSteps+avgCostUsdin the breakdown).studybuddy.json(embedded) — 10 starter DDIA "confusing passage → reference explanation" cases; a starter to curate to the DoD's 25 against the live edition (align chapter numbers to it).POST /admin/ai-quality/evals/studybuddy/run?editionId=&judge=— admin-triggered against a real embedded edition (503 when the judge isn't configured). The agent's tools resolve scoped services from the request scope.Verification
StudyBuddyEvalRunnerwith a direct-answering fake agent (no tools, one iteration) + fixed judge: scores the whole golden set, aggregates the judge mean, computes avg steps (1.0) + cost deterministically. Golden count read from the dataset (survives growth to 25).TextStack.UnitTests(272) +TextStack.AiEvals(29) green; solution builds;dotnet formatclean.After merge (closes the phase)
Run the admin eval on prod against the DDIA edition → confirm judge ≥4/5, avg steps ≤4, cost <$0.05; curate the golden set to 25. The agent is already live end-to-end in the reader (#330).
🤖 Generated with Claude Code