mrviduus · mrviduus · Jun 15, 2026 · Jun 15, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,15 @@
 
 ## [Unreleased]
 
+### Phase 6 — Study Buddy golden set + eval — Phase 6 complete (2026-06-15)
+
+Phase 6 **AI-039** — the DoD gate for the Study Buddy agent (judge ≥4/5, avg steps ≤4, cost <$0.05). **This closes Phase 6** (agent loop → tools → persistence → SSE endpoint → reader UI → eval).
+
+- **`StudyBuddyEvalRunner`** (`Ai.EvalSuite`) — per golden: run the REAL agent against a real edition (its tools hit the live corpus), score the final answer against the reference with the same MEAI `RubricEvaluator` the suite uses (correctness / grounding / clarity), and record iterations + cost. A budget-exhausted run is a failed case (judge 0, capped steps). Persists a `studybuddy` `EvalRun` (judge mean 1–5; avg steps + cost in the breakdown).
+- **Golden set** `studybuddy.json` (embedded) — 10 starter DDIA "confusing passage → reference explanation" cases; a **starter to curate to the DoD's 25** against the live edition (align chapter numbers to it).
+- **`POST /admin/ai-quality/evals/studybuddy/run?editionId=&judge=`** — admin-triggered against a real embedded edition (503 when the judge isn't configured). The agent's tools resolve scoped services from the request scope.
+- Tests: `StudyBuddyEvalRunner` with a direct-answering fake agent + fixed judge — scores the whole golden set, aggregates the judge mean, and computes avg steps (1.0) + cost deterministically. Golden count read from the dataset so it survives growth to 25.
+
 ### Phase 6 — Study Buddy wired into the reader (2026-06-15)
 
 Phase 6 **AI-038, slice b** — the panel is now reachable: select a passage in the reader → "Help me understand this" → the agent investigates live.

diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/studybuddy.json b/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/studybuddy.json
@@ -0,0 +1,52 @@
+[
+  {
+    "passage": "The CAP theorem is sometimes stated as \"Consistency, Availability, Partition tolerance: pick two out of three.\" Unfortunately, putting it this way is misleading.",
+    "chapterNumber": 9,
+    "rubricAnswer": "Partitions aren't something you choose — they happen. So the real choice is only between consistency (linearizability) and availability, and only WHEN a network partition is occurring. The popular '2 of 3' phrasing is misleading because partition tolerance isn't optional."
+  },
+  {
+    "passage": "In a leaderless replication system, the client sends each write to several replicas in parallel, and read requests are also sent to several nodes in parallel.",
+    "chapterNumber": 5,
+    "rubricAnswer": "Leaderless (Dynamo-style) replication has no single leader; the client (or a coordinator) writes to and reads from several replicas at once. Quorum overlap (w + r > n) is what gives a good chance of reading an up-to-date value despite no leader ordering the writes."
+  },
+  {
+    "passage": "An advantage of append-only log-structured storage is that segment files are immutable, so concurrent reads do not need to take locks on them.",
+    "chapterNumber": 3,
+    "rubricAnswer": "Because segments are only ever appended to and never modified in place, a reader can scan a segment while writes go elsewhere — no read/write locking needed. Immutability also simplifies crash recovery, since a partially written record is just ignored."
+  },
+  {
+    "passage": "Serializable snapshot isolation (SSI) is an optimistic concurrency control technique.",
+    "chapterNumber": 7,
+    "rubricAnswer": "Unlike pessimistic two-phase locking, which blocks to prevent conflicts, SSI lets transactions run on a snapshot and only checks at commit time whether they actually conflicted; if so, one is aborted and retried. It is optimistic because it assumes conflicts are rare."
+  },
+  {
+    "passage": "A node in the network cannot reliably tell whether another node has crashed or is merely slow to respond.",
+    "chapterNumber": 8,
+    "rubricAnswer": "On an asynchronous network there is no way to distinguish a dead node from a slow one — a missing reply could mean the node crashed, the request was lost, or it's just delayed. Systems use timeouts to guess, but a timeout is only a heuristic, not a certainty."
+  },
+  {
+    "passage": "Linearizability is a recency guarantee: it makes a system appear as though there is only a single copy of the data.",
+    "chapterNumber": 9,
+    "rubricAnswer": "Linearizability means that once a write completes, every later read sees that value (or a newer one) — the replicated system behaves as if there were one copy updated atomically. It's a strong consistency guarantee, stronger than eventual consistency."
+  },
+  {
+    "passage": "Skewed workloads can cause hot spots, where one partition receives disproportionately more load than the others.",
+    "chapterNumber": 6,
+    "rubricAnswer": "If keys aren't spread evenly (e.g. a celebrity user, or many records sharing a partition key), one partition gets far more traffic than the rest — a hot spot — which limits scalability because that node becomes the bottleneck. Hashing keys or adding a random prefix helps spread the load."
+  },
+  {
+    "passage": "Derived data, such as a search index or a cache, can always be recreated from the underlying source data.",
+    "chapterNumber": 11,
+    "rubricAnswer": "Source-of-truth (system of record) data is authoritative; derived data (indexes, caches, materialized views) is a transformation of it and is redundant — if lost or corrupted, it can be rebuilt by re-running the transformation over the source. That's why derived data can be treated as disposable."
+  },
+  {
+    "passage": "With event sourcing, application state is determined by a log of immutable events that are only ever appended.",
+    "chapterNumber": 11,
+    "rubricAnswer": "Instead of storing only the current state, event sourcing records every change as an immutable event in an append-only log; the current state is derived by replaying those events. This keeps a full history and lets you rebuild new views, at the cost of having to fold the log into state."
+  },
+  {
+    "passage": "The danger of two-phase locking is that transactions can deadlock, each waiting for a lock the other holds.",
+    "chapterNumber": 7,
+    "rubricAnswer": "Two-phase locking (2PL) makes transactions acquire locks and hold them until commit; if transaction A holds a lock B wants and B holds a lock A wants, neither can proceed — a deadlock. The database detects the cycle and aborts one transaction so the other can continue."
+  }
+]
diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyEvalRunner.cs b/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyEvalRunner.cs
@@ -0,0 +1,134 @@
+using Application.Agents;
+using Application.Common.Interfaces;
+using Domain.Entities;
+using Microsoft.Extensions.AI;
+using Microsoft.Extensions.AI.Evaluation;
+using Microsoft.Extensions.Logging;
+using TextStack.Ai.Core;
+using TextStack.Ai.Evals;
+using TextStack.Ai.Llm;
+
+namespace TextStack.Ai.EvalSuite;
+
+/// <summary>One golden's outcome — for the admin UI.</summary>
+public sealed record StudyBuddyCase(string Passage, int Steps, decimal CostUsd, double JudgeScore, bool Completed);
+
+/// <summary>
+/// Result of a Study Buddy eval run (AI-039) — the Phase 6 DoD numbers: the judge mean (≥4/5 target),
+/// average iterations per run (≤4 target: a planner that's neither lazy nor over-eager) and average
+/// cost (&lt;$0.05 target).
+/// </summary>
+public sealed record StudyBuddyEvalResult(
+    double JudgeScore, double AvgSteps, decimal AvgCostUsd, int N, IReadOnlyList<StudyBuddyCase> Cases);
+
+/// <summary>
+/// Runs the Phase 6 Study Buddy eval (AI-039): for each golden passage, run the REAL agent against a
+/// real edition (its tools hit the live corpus), score the final answer against the reference with the
+/// same MEAI <see cref="RubricEvaluator"/> the rest of the suite uses, and record steps + cost. A
+/// budget-exhausted run is a failed case (judge 0, steps = the cap). Persists a <c>studybuddy</c>
+/// <see cref="EvalRun"/> (judge mean 1–5; avg steps + cost in the breakdown).
+/// </summary>
+public sealed class StudyBuddyEvalRunner(ILogger<StudyBuddyEvalRunner> logger)
+{
+    private const string Feature = "studybuddy";
+
+    private static readonly Rubric Rubric = new(
+        "correctness: does the explanation accurately convey the meaning the reference explanation captures?",
+        "grounding: is it specific to THIS passage and free of invented facts (consistent with the book's domain)?",
+        "clarity: is it a clear, concise 3-5 sentence explanation a developer would find genuinely helpful?");
+
+    private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];
+
+    public async Task<StudyBuddyEvalResult> RunAsync(
+        StudyBuddyAgent agent,
+        ILlmService judge,
+        string judgeModelId,
+        Guid editionId,
+        Guid? userId,
+        IServiceProvider services,
+        bool persist,
+        IAppDbContext? db,
+        string? gitSha,
+        CancellationToken ct)
+    {
+        var goldens = StudyBuddyGoldenSet.Load();
+        var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));
+
+        var cases = new List<StudyBuddyCase>();
+        var scores = new List<JudgeScore>();
+        var totalSteps = 0;
+        var totalCost = 0m;
+
+        foreach (var g in goldens)
+        {
+            ct.ThrowIfCancellationRequested();
+
+            // A fresh run id per golden; the agent's tools resolve scoped services from `services`.
+            var agentCtx = new AgentContext(userId, editionId, Guid.NewGuid(), services);
+            string answer;
+            int steps;
+            decimal cost;
+            try
+            {
+                var result = await agent.RunAsync(new StudyBuddyInput(g.Passage, editionId, g.ChapterNumber), agentCtx, ct);
+                answer = result.Output;
+                steps = result.Usage.Iterations;
+                cost = result.Usage.CostUsdTotal;
+            }
+            catch (AgentBudgetExhaustedException ex)
+            {
+                // Hitting the cap is a real failure mode: score it 0 and count its (capped) steps/cost.
+                cases.Add(new StudyBuddyCase(g.Passage, ex.Usage.Iterations, ex.Usage.CostUsdTotal, 0, Completed: false));
+                scores.Add(new JudgeScore(0, 0, 0, "budget exhausted"));
+                totalSteps += ex.Usage.Iterations;
+                totalCost += ex.Usage.CostUsdTotal;
+                continue;
+            }
+
+            var evidence =
+                $"Passage:\n{g.Passage}\n\nReference explanation:\n{g.RubricAnswer}\n\nAgent's explanation:\n{answer}";
+            var evaluator = new RubricEvaluator(Feature, Rubric);
+            var result2 = await evaluator.EvaluateAsync(
+                JudgePlaceholderMessages, new ChatResponse(new ChatMessage(ChatRole.Assistant, answer)),
+                chatConfig, [new RubricEvidenceContext(evidence)], ct);
+
+            var score = new JudgeScore(
+                ReadAxis(result2, Rubric.Dim1), ReadAxis(result2, Rubric.Dim2), ReadAxis(result2, Rubric.Dim3), string.Empty);
+            scores.Add(score);
+            cases.Add(new StudyBuddyCase(g.Passage, steps, cost, score.Mean, Completed: true));
+            totalSteps += steps;
+            totalCost += cost;
+        }
+
+        var n = cases.Count;
+        var judgeMean = scores.Count > 0 ? JudgeRunner.Aggregate(scores).MeanOverall : 0;
+        var avgSteps = n > 0 ? (double)totalSteps / n : 0;
+        var avgCost = n > 0 ? totalCost / n : 0m;
+
+        logger.LogInformation(
+            "Study Buddy eval edition={Edition} judge={Judge:0.00} avgSteps={Steps:0.0} avgCost=${Cost} (N={N})",
+            editionId, judgeMean, avgSteps, avgCost, n);
+
+        if (persist && db is not null)
+        {
+            db.EvalRuns.Add(new EvalRun
+            {
+                Id = Guid.NewGuid(),
+                Feature = Feature,
+                ModelId = "agent",
+                JudgeModelId = judgeModelId,
+                Score = Math.Round((decimal)judgeMean, 3),
+                N = n,
+                BreakdownJson = $"{{\"avgSteps\":{avgSteps:0.00},\"avgCostUsd\":{avgCost:0.0000},\"completed\":{cases.Count(c => c.Completed)}}}",
+                GitSha = gitSha,
+                CreatedAt = DateTimeOffset.UtcNow,
+            });
+            await db.SaveChangesAsync(ct);
+        }
+
+        return new StudyBuddyEvalResult(judgeMean, avgSteps, avgCost, n, cases);
+    }
+
+    private static int ReadAxis(EvaluationResult result, string dim) =>
+        (int)Math.Round(result.Get<NumericMetric>($"{Feature}.{dim.Split(':')[0].Trim()}").Value ?? 0);
+}
diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyGolden.cs b/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyGolden.cs
@@ -0,0 +1,14 @@
+namespace TextStack.Ai.EvalSuite;
+
+/// <summary>
+/// One Study Buddy golden case (AI-039): a confusing passage (with its chapter for context) plus a
+/// reference explanation the judge scores the agent's answer against. The passages target the eval
+/// edition (DDIA); align <see cref="ChapterNumber"/> to that edition's numbering.
+/// </summary>
+public record StudyBuddyGolden(string Passage, int? ChapterNumber, string RubricAnswer);
+
+/// <summary>Loads the embedded Study Buddy golden set (<c>studybuddy.json</c>).</summary>
+public static class StudyBuddyGoldenSet
+{
+    public static IReadOnlyList<StudyBuddyGolden> Load() => GoldenLoader.Load<StudyBuddyGolden>("studybuddy.json");
+}
diff --git a/backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs b/backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs
@@ -5,6 +5,7 @@
 using Microsoft.EntityFrameworkCore;
 using Microsoft.Extensions.DependencyInjection;
 using TextStack.Ai.Core;
+using Application.Agents;
 using TextStack.Ai.EvalSuite;
 using TextStack.Ai.Tools;
 
@@ -28,6 +29,60 @@ public static void MapAdminAiQualityEndpoints(this WebApplication app)
         group.MapPost("/evals/run", RunEvals);
         group.MapGet("/evals/status", GetEvalStatus);
         group.MapPost("/evals/toolcalls/run", RunToolCallEval);
+        group.MapPost("/evals/studybuddy/run", RunStudyBuddyEval);
+    }
+
+    // Phase 6 DoD gate (AI-039): runs the Study Buddy agent over the golden passages against a real
+    // edition and scores the answers + records steps/cost. Needs an embedded edition (DDIA) + a key.
+    private static async Task<IResult> RunStudyBuddyEval(
+        [FromQuery] Guid editionId,
+        [FromQuery] string? judge,
+        HttpContext httpContext,
+        IServiceProvider services,
+        IConfiguration config,
+        StudyBuddyEvalRunner runner,
+        StudyBuddyAgent agent,
+        IAppDbContext db,
+        CancellationToken ct)
+    {
+        if (editionId == Guid.Empty)
+            return Results.BadRequest(new { error = "editionId query parameter is required." });
+
+        var useOllama = string.Equals(judge, "ollama", StringComparison.OrdinalIgnoreCase);
+        var judgeKey = useOllama ? "ollama" : "openai-judge";
+        var judgeModelId = useOllama ? config["Ollama:Model"] ?? "gemma4:e4b" : config["Eval:JudgeModel"] ?? "gpt-4.1";
+
+        ILlmService judgeClient;
+        try
+        {
+            judgeClient = services.GetRequiredKeyedService<ILlmService>(judgeKey);
+        }
+        catch (InvalidOperationException)
+        {
+            return Results.Problem("Judge LLM is not configured.", statusCode: 503);
+        }
+
+        var gitSha = Environment.GetEnvironmentVariable("GIT_SHA");
+        // The agent's tools resolve scoped services (db, retrieval) from the request scope.
+        var result = await runner.RunAsync(
+            agent, judgeClient, judgeModelId, editionId, userId: null, httpContext.RequestServices,
+            persist: true, db, gitSha, ct);
+
+        return Results.Ok(new
+        {
+            judgeScore = Math.Round(result.JudgeScore, 3),
+            avgSteps = Math.Round(result.AvgSteps, 2),
+            avgCostUsd = result.AvgCostUsd,
+            n = result.N,
+            cases = result.Cases.Select(c => new
+            {
+                passage = c.Passage.Length > 80 ? c.Passage[..80] + "…" : c.Passage,
+                c.Steps,
+                c.CostUsd,
+                c.JudgeScore,
+                c.Completed,
+            }),
+        });
     }
 
     // Phase 5 DoD gate (AI-033): deterministic tool-call accuracy over the embedded golden set.

diff --git a/backend/src/Api/Program.cs b/backend/src/Api/Program.cs
@@ -83,6 +83,7 @@
 builder.Services.AddSingleton<TextStack.Ai.EvalSuite.EvalSuiteRunner>();
 builder.Services.AddSingleton<TextStack.Ai.EvalSuite.RagEvalRunner>();
 builder.Services.AddSingleton<TextStack.Ai.EvalSuite.ToolCallEvalRunner>();
+builder.Services.AddSingleton<TextStack.Ai.EvalSuite.StudyBuddyEvalRunner>();
 // Tool catalogue (AI-029/030): scans Application for ITool impls; dispatch is schema-validated.
 builder.Services.AddAiTools(typeof(Application.Tools.GetChapterTool).Assembly);
 // Agent loop engine (Phase 6, AI-034). Concrete agents (StudyBuddy, AI-035) build on it.

diff --git a/tests/TextStack.AiEvals/StudyBuddyEvalRunnerTests.cs b/tests/TextStack.AiEvals/StudyBuddyEvalRunnerTests.cs
@@ -0,0 +1,63 @@
+using Application.Agents;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Logging.Abstractions;
+using TextStack.Ai.Agents;
+using TextStack.Ai.Core;
+using TextStack.Ai.EvalSuite;
+using TextStack.Ai.Tools;
+
+namespace TextStack.AiEvals;
+
+/// <summary>
+/// Deterministic coverage for <see cref="StudyBuddyEvalRunner"/> (AI-039): the agent runs on a fake
+/// LLM that answers directly (no tools, one iteration), the judge is fixed — so the run scores the
+/// golden set, aggregates the judge mean, and computes avg steps + cost without a key or corpus.
+/// </summary>
+public class StudyBuddyEvalRunnerTests
+{
+    private static readonly int GoldenN = StudyBuddyGoldenSet.Load().Count;
+
+    /// <summary>Answers every prompt directly (no tool calls) with a fixed cost — one iteration per run.</summary>
+    private sealed class DirectLlm : ILlmService
+    {
+        public Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct) =>
+            Task.FromResult(new LlmResponse("A grounded explanation of the passage.", [], new LlmUsage(40, 20, 0.003m), "agent-model", Guid.NewGuid()));
+
+        public IAsyncEnumerable<LlmDelta> StreamAsync(LlmRequest request, CancellationToken ct) =>
+            throw new NotSupportedException();
+    }
+
+    private sealed class FixedJudge(int d1, int d2, int d3) : ILlmService
+    {
+        public Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct) =>
+            Task.FromResult(new LlmResponse(
+                $"{{\"d1\": {d1}, \"d2\": {d2}, \"d3\": {d3}, \"rationale\": \"ok\"}}",
+                [], new LlmUsage(0, 0, 0m), "judge", Guid.NewGuid()));
+
+        public IAsyncEnumerable<LlmDelta> StreamAsync(LlmRequest request, CancellationToken ct) =>
+            throw new NotSupportedException();
+    }
+
+    private static StudyBuddyAgent Agent(ILlmService llm)
+    {
+        var registry = new ToolRegistry([]);
+        return new StudyBuddyAgent(new AgentLoop(llm, registry, new ToolDispatcher(registry)));
+    }
+
+    [Fact]
+    public async Task RunAsync_DirectAgent_ScoresGoldensAndAggregates()
+    {
+        var runner = new StudyBuddyEvalRunner(NullLogger<StudyBuddyEvalRunner>.Instance);
+
+        var result = await runner.RunAsync(
+            Agent(new DirectLlm()), new FixedJudge(5, 4, 5), "judge-test",
+            Guid.NewGuid(), userId: null, new ServiceCollection().BuildServiceProvider(),
+            persist: false, db: null, gitSha: null, TestContext.Current.CancellationToken);
+
+        Assert.Equal(GoldenN, result.N);
+        Assert.Equal((5 + 4 + 5) / 3.0, result.JudgeScore, 3); // every case scored 4.67
+        Assert.Equal(1.0, result.AvgSteps, 3);                  // direct answer → one iteration each
+        Assert.Equal(0.003m, result.AvgCostUsd);                // fixed per-run cost
+        Assert.All(result.Cases, c => Assert.True(c.Completed));
+    }
+}