diff --git a/CHANGELOG.md b/CHANGELOG.md
index 16cf3df4..d6b56a17 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,15 @@
## [Unreleased]
+### Phase 6 — Study Buddy golden set + eval — Phase 6 complete (2026-06-15)
+
+Phase 6 **AI-039** — the DoD gate for the Study Buddy agent (judge ≥4/5, avg steps ≤4, cost <$0.05). **This closes Phase 6** (agent loop → tools → persistence → SSE endpoint → reader UI → eval).
+
+- **`StudyBuddyEvalRunner`** (`Ai.EvalSuite`) — per golden: run the REAL agent against a real edition (its tools hit the live corpus), score the final answer against the reference with the same MEAI `RubricEvaluator` the suite uses (correctness / grounding / clarity), and record iterations + cost. A budget-exhausted run is a failed case (judge 0, capped steps). Persists a `studybuddy` `EvalRun` (judge mean 1–5; avg steps + cost in the breakdown).
+- **Golden set** `studybuddy.json` (embedded) — 10 starter DDIA "confusing passage → reference explanation" cases; a **starter to curate to the DoD's 25** against the live edition (align chapter numbers to it).
+- **`POST /admin/ai-quality/evals/studybuddy/run?editionId=&judge=`** — admin-triggered against a real embedded edition (503 when the judge isn't configured). The agent's tools resolve scoped services from the request scope.
+- Tests: `StudyBuddyEvalRunner` with a direct-answering fake agent + fixed judge — scores the whole golden set, aggregates the judge mean, and computes avg steps (1.0) + cost deterministically. Golden count read from the dataset so it survives growth to 25.
+
### Phase 6 — Study Buddy wired into the reader (2026-06-15)
Phase 6 **AI-038, slice b** — the panel is now reachable: select a passage in the reader → "Help me understand this" → the agent investigates live.
diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/studybuddy.json b/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/studybuddy.json
new file mode 100644
index 00000000..9d6a6b5f
--- /dev/null
+++ b/backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/studybuddy.json
@@ -0,0 +1,52 @@
+[
+ {
+ "passage": "The CAP theorem is sometimes stated as \"Consistency, Availability, Partition tolerance: pick two out of three.\" Unfortunately, putting it this way is misleading.",
+ "chapterNumber": 9,
+ "rubricAnswer": "Partitions aren't something you choose — they happen. So the real choice is only between consistency (linearizability) and availability, and only WHEN a network partition is occurring. The popular '2 of 3' phrasing is misleading because partition tolerance isn't optional."
+ },
+ {
+ "passage": "In a leaderless replication system, the client sends each write to several replicas in parallel, and read requests are also sent to several nodes in parallel.",
+ "chapterNumber": 5,
+ "rubricAnswer": "Leaderless (Dynamo-style) replication has no single leader; the client (or a coordinator) writes to and reads from several replicas at once. Quorum overlap (w + r > n) is what gives a good chance of reading an up-to-date value despite no leader ordering the writes."
+ },
+ {
+ "passage": "An advantage of append-only log-structured storage is that segment files are immutable, so concurrent reads do not need to take locks on them.",
+ "chapterNumber": 3,
+ "rubricAnswer": "Because segments are only ever appended to and never modified in place, a reader can scan a segment while writes go elsewhere — no read/write locking needed. Immutability also simplifies crash recovery, since a partially written record is just ignored."
+ },
+ {
+ "passage": "Serializable snapshot isolation (SSI) is an optimistic concurrency control technique.",
+ "chapterNumber": 7,
+ "rubricAnswer": "Unlike pessimistic two-phase locking, which blocks to prevent conflicts, SSI lets transactions run on a snapshot and only checks at commit time whether they actually conflicted; if so, one is aborted and retried. It is optimistic because it assumes conflicts are rare."
+ },
+ {
+ "passage": "A node in the network cannot reliably tell whether another node has crashed or is merely slow to respond.",
+ "chapterNumber": 8,
+ "rubricAnswer": "On an asynchronous network there is no way to distinguish a dead node from a slow one — a missing reply could mean the node crashed, the request was lost, or it's just delayed. Systems use timeouts to guess, but a timeout is only a heuristic, not a certainty."
+ },
+ {
+ "passage": "Linearizability is a recency guarantee: it makes a system appear as though there is only a single copy of the data.",
+ "chapterNumber": 9,
+ "rubricAnswer": "Linearizability means that once a write completes, every later read sees that value (or a newer one) — the replicated system behaves as if there were one copy updated atomically. It's a strong consistency guarantee, stronger than eventual consistency."
+ },
+ {
+ "passage": "Skewed workloads can cause hot spots, where one partition receives disproportionately more load than the others.",
+ "chapterNumber": 6,
+ "rubricAnswer": "If keys aren't spread evenly (e.g. a celebrity user, or many records sharing a partition key), one partition gets far more traffic than the rest — a hot spot — which limits scalability because that node becomes the bottleneck. Hashing keys or adding a random prefix helps spread the load."
+ },
+ {
+ "passage": "Derived data, such as a search index or a cache, can always be recreated from the underlying source data.",
+ "chapterNumber": 11,
+ "rubricAnswer": "Source-of-truth (system of record) data is authoritative; derived data (indexes, caches, materialized views) is a transformation of it and is redundant — if lost or corrupted, it can be rebuilt by re-running the transformation over the source. That's why derived data can be treated as disposable."
+ },
+ {
+ "passage": "With event sourcing, application state is determined by a log of immutable events that are only ever appended.",
+ "chapterNumber": 11,
+ "rubricAnswer": "Instead of storing only the current state, event sourcing records every change as an immutable event in an append-only log; the current state is derived by replaying those events. This keeps a full history and lets you rebuild new views, at the cost of having to fold the log into state."
+ },
+ {
+ "passage": "The danger of two-phase locking is that transactions can deadlock, each waiting for a lock the other holds.",
+ "chapterNumber": 7,
+ "rubricAnswer": "Two-phase locking (2PL) makes transactions acquire locks and hold them until commit; if transaction A holds a lock B wants and B holds a lock A wants, neither can proceed — a deadlock. The database detects the cycle and aborts one transaction so the other can continue."
+ }
+]
diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyEvalRunner.cs b/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyEvalRunner.cs
new file mode 100644
index 00000000..480ec453
--- /dev/null
+++ b/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyEvalRunner.cs
@@ -0,0 +1,134 @@
+using Application.Agents;
+using Application.Common.Interfaces;
+using Domain.Entities;
+using Microsoft.Extensions.AI;
+using Microsoft.Extensions.AI.Evaluation;
+using Microsoft.Extensions.Logging;
+using TextStack.Ai.Core;
+using TextStack.Ai.Evals;
+using TextStack.Ai.Llm;
+
+namespace TextStack.Ai.EvalSuite;
+
+/// One golden's outcome — for the admin UI.
+public sealed record StudyBuddyCase(string Passage, int Steps, decimal CostUsd, double JudgeScore, bool Completed);
+
+///
+/// Result of a Study Buddy eval run (AI-039) — the Phase 6 DoD numbers: the judge mean (≥4/5 target),
+/// average iterations per run (≤4 target: a planner that's neither lazy nor over-eager) and average
+/// cost (<$0.05 target).
+///
+public sealed record StudyBuddyEvalResult(
+ double JudgeScore, double AvgSteps, decimal AvgCostUsd, int N, IReadOnlyList Cases);
+
+///
+/// Runs the Phase 6 Study Buddy eval (AI-039): for each golden passage, run the REAL agent against a
+/// real edition (its tools hit the live corpus), score the final answer against the reference with the
+/// same MEAI the rest of the suite uses, and record steps + cost. A
+/// budget-exhausted run is a failed case (judge 0, steps = the cap). Persists a studybuddy
+/// (judge mean 1–5; avg steps + cost in the breakdown).
+///
+public sealed class StudyBuddyEvalRunner(ILogger logger)
+{
+ private const string Feature = "studybuddy";
+
+ private static readonly Rubric Rubric = new(
+ "correctness: does the explanation accurately convey the meaning the reference explanation captures?",
+ "grounding: is it specific to THIS passage and free of invented facts (consistent with the book's domain)?",
+ "clarity: is it a clear, concise 3-5 sentence explanation a developer would find genuinely helpful?");
+
+ private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];
+
+ public async Task RunAsync(
+ StudyBuddyAgent agent,
+ ILlmService judge,
+ string judgeModelId,
+ Guid editionId,
+ Guid? userId,
+ IServiceProvider services,
+ bool persist,
+ IAppDbContext? db,
+ string? gitSha,
+ CancellationToken ct)
+ {
+ var goldens = StudyBuddyGoldenSet.Load();
+ var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));
+
+ var cases = new List();
+ var scores = new List();
+ var totalSteps = 0;
+ var totalCost = 0m;
+
+ foreach (var g in goldens)
+ {
+ ct.ThrowIfCancellationRequested();
+
+ // A fresh run id per golden; the agent's tools resolve scoped services from `services`.
+ var agentCtx = new AgentContext(userId, editionId, Guid.NewGuid(), services);
+ string answer;
+ int steps;
+ decimal cost;
+ try
+ {
+ var result = await agent.RunAsync(new StudyBuddyInput(g.Passage, editionId, g.ChapterNumber), agentCtx, ct);
+ answer = result.Output;
+ steps = result.Usage.Iterations;
+ cost = result.Usage.CostUsdTotal;
+ }
+ catch (AgentBudgetExhaustedException ex)
+ {
+ // Hitting the cap is a real failure mode: score it 0 and count its (capped) steps/cost.
+ cases.Add(new StudyBuddyCase(g.Passage, ex.Usage.Iterations, ex.Usage.CostUsdTotal, 0, Completed: false));
+ scores.Add(new JudgeScore(0, 0, 0, "budget exhausted"));
+ totalSteps += ex.Usage.Iterations;
+ totalCost += ex.Usage.CostUsdTotal;
+ continue;
+ }
+
+ var evidence =
+ $"Passage:\n{g.Passage}\n\nReference explanation:\n{g.RubricAnswer}\n\nAgent's explanation:\n{answer}";
+ var evaluator = new RubricEvaluator(Feature, Rubric);
+ var result2 = await evaluator.EvaluateAsync(
+ JudgePlaceholderMessages, new ChatResponse(new ChatMessage(ChatRole.Assistant, answer)),
+ chatConfig, [new RubricEvidenceContext(evidence)], ct);
+
+ var score = new JudgeScore(
+ ReadAxis(result2, Rubric.Dim1), ReadAxis(result2, Rubric.Dim2), ReadAxis(result2, Rubric.Dim3), string.Empty);
+ scores.Add(score);
+ cases.Add(new StudyBuddyCase(g.Passage, steps, cost, score.Mean, Completed: true));
+ totalSteps += steps;
+ totalCost += cost;
+ }
+
+ var n = cases.Count;
+ var judgeMean = scores.Count > 0 ? JudgeRunner.Aggregate(scores).MeanOverall : 0;
+ var avgSteps = n > 0 ? (double)totalSteps / n : 0;
+ var avgCost = n > 0 ? totalCost / n : 0m;
+
+ logger.LogInformation(
+ "Study Buddy eval edition={Edition} judge={Judge:0.00} avgSteps={Steps:0.0} avgCost=${Cost} (N={N})",
+ editionId, judgeMean, avgSteps, avgCost, n);
+
+ if (persist && db is not null)
+ {
+ db.EvalRuns.Add(new EvalRun
+ {
+ Id = Guid.NewGuid(),
+ Feature = Feature,
+ ModelId = "agent",
+ JudgeModelId = judgeModelId,
+ Score = Math.Round((decimal)judgeMean, 3),
+ N = n,
+ BreakdownJson = $"{{\"avgSteps\":{avgSteps:0.00},\"avgCostUsd\":{avgCost:0.0000},\"completed\":{cases.Count(c => c.Completed)}}}",
+ GitSha = gitSha,
+ CreatedAt = DateTimeOffset.UtcNow,
+ });
+ await db.SaveChangesAsync(ct);
+ }
+
+ return new StudyBuddyEvalResult(judgeMean, avgSteps, avgCost, n, cases);
+ }
+
+ private static int ReadAxis(EvaluationResult result, string dim) =>
+ (int)Math.Round(result.Get($"{Feature}.{dim.Split(':')[0].Trim()}").Value ?? 0);
+}
diff --git a/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyGolden.cs b/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyGolden.cs
new file mode 100644
index 00000000..d67ef2a4
--- /dev/null
+++ b/backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyGolden.cs
@@ -0,0 +1,14 @@
+namespace TextStack.Ai.EvalSuite;
+
+///
+/// One Study Buddy golden case (AI-039): a confusing passage (with its chapter for context) plus a
+/// reference explanation the judge scores the agent's answer against. The passages target the eval
+/// edition (DDIA); align to that edition's numbering.
+///
+public record StudyBuddyGolden(string Passage, int? ChapterNumber, string RubricAnswer);
+
+/// Loads the embedded Study Buddy golden set (studybuddy.json).
+public static class StudyBuddyGoldenSet
+{
+ public static IReadOnlyList Load() => GoldenLoader.Load("studybuddy.json");
+}
diff --git a/backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs b/backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs
index 1b0d2824..6826c18e 100644
--- a/backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs
+++ b/backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs
@@ -5,6 +5,7 @@
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.DependencyInjection;
using TextStack.Ai.Core;
+using Application.Agents;
using TextStack.Ai.EvalSuite;
using TextStack.Ai.Tools;
@@ -28,6 +29,60 @@ public static void MapAdminAiQualityEndpoints(this WebApplication app)
group.MapPost("/evals/run", RunEvals);
group.MapGet("/evals/status", GetEvalStatus);
group.MapPost("/evals/toolcalls/run", RunToolCallEval);
+ group.MapPost("/evals/studybuddy/run", RunStudyBuddyEval);
+ }
+
+ // Phase 6 DoD gate (AI-039): runs the Study Buddy agent over the golden passages against a real
+ // edition and scores the answers + records steps/cost. Needs an embedded edition (DDIA) + a key.
+ private static async Task RunStudyBuddyEval(
+ [FromQuery] Guid editionId,
+ [FromQuery] string? judge,
+ HttpContext httpContext,
+ IServiceProvider services,
+ IConfiguration config,
+ StudyBuddyEvalRunner runner,
+ StudyBuddyAgent agent,
+ IAppDbContext db,
+ CancellationToken ct)
+ {
+ if (editionId == Guid.Empty)
+ return Results.BadRequest(new { error = "editionId query parameter is required." });
+
+ var useOllama = string.Equals(judge, "ollama", StringComparison.OrdinalIgnoreCase);
+ var judgeKey = useOllama ? "ollama" : "openai-judge";
+ var judgeModelId = useOllama ? config["Ollama:Model"] ?? "gemma4:e4b" : config["Eval:JudgeModel"] ?? "gpt-4.1";
+
+ ILlmService judgeClient;
+ try
+ {
+ judgeClient = services.GetRequiredKeyedService(judgeKey);
+ }
+ catch (InvalidOperationException)
+ {
+ return Results.Problem("Judge LLM is not configured.", statusCode: 503);
+ }
+
+ var gitSha = Environment.GetEnvironmentVariable("GIT_SHA");
+ // The agent's tools resolve scoped services (db, retrieval) from the request scope.
+ var result = await runner.RunAsync(
+ agent, judgeClient, judgeModelId, editionId, userId: null, httpContext.RequestServices,
+ persist: true, db, gitSha, ct);
+
+ return Results.Ok(new
+ {
+ judgeScore = Math.Round(result.JudgeScore, 3),
+ avgSteps = Math.Round(result.AvgSteps, 2),
+ avgCostUsd = result.AvgCostUsd,
+ n = result.N,
+ cases = result.Cases.Select(c => new
+ {
+ passage = c.Passage.Length > 80 ? c.Passage[..80] + "…" : c.Passage,
+ c.Steps,
+ c.CostUsd,
+ c.JudgeScore,
+ c.Completed,
+ }),
+ });
}
// Phase 5 DoD gate (AI-033): deterministic tool-call accuracy over the embedded golden set.
diff --git a/backend/src/Api/Program.cs b/backend/src/Api/Program.cs
index 1f6ff86e..4042f84d 100644
--- a/backend/src/Api/Program.cs
+++ b/backend/src/Api/Program.cs
@@ -83,6 +83,7 @@
builder.Services.AddSingleton();
builder.Services.AddSingleton();
builder.Services.AddSingleton();
+builder.Services.AddSingleton();
// Tool catalogue (AI-029/030): scans Application for ITool impls; dispatch is schema-validated.
builder.Services.AddAiTools(typeof(Application.Tools.GetChapterTool).Assembly);
// Agent loop engine (Phase 6, AI-034). Concrete agents (StudyBuddy, AI-035) build on it.
diff --git a/tests/TextStack.AiEvals/StudyBuddyEvalRunnerTests.cs b/tests/TextStack.AiEvals/StudyBuddyEvalRunnerTests.cs
new file mode 100644
index 00000000..c6dc2221
--- /dev/null
+++ b/tests/TextStack.AiEvals/StudyBuddyEvalRunnerTests.cs
@@ -0,0 +1,63 @@
+using Application.Agents;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Logging.Abstractions;
+using TextStack.Ai.Agents;
+using TextStack.Ai.Core;
+using TextStack.Ai.EvalSuite;
+using TextStack.Ai.Tools;
+
+namespace TextStack.AiEvals;
+
+///
+/// Deterministic coverage for (AI-039): the agent runs on a fake
+/// LLM that answers directly (no tools, one iteration), the judge is fixed — so the run scores the
+/// golden set, aggregates the judge mean, and computes avg steps + cost without a key or corpus.
+///
+public class StudyBuddyEvalRunnerTests
+{
+ private static readonly int GoldenN = StudyBuddyGoldenSet.Load().Count;
+
+ /// Answers every prompt directly (no tool calls) with a fixed cost — one iteration per run.
+ private sealed class DirectLlm : ILlmService
+ {
+ public Task CompleteAsync(LlmRequest request, CancellationToken ct) =>
+ Task.FromResult(new LlmResponse("A grounded explanation of the passage.", [], new LlmUsage(40, 20, 0.003m), "agent-model", Guid.NewGuid()));
+
+ public IAsyncEnumerable StreamAsync(LlmRequest request, CancellationToken ct) =>
+ throw new NotSupportedException();
+ }
+
+ private sealed class FixedJudge(int d1, int d2, int d3) : ILlmService
+ {
+ public Task CompleteAsync(LlmRequest request, CancellationToken ct) =>
+ Task.FromResult(new LlmResponse(
+ $"{{\"d1\": {d1}, \"d2\": {d2}, \"d3\": {d3}, \"rationale\": \"ok\"}}",
+ [], new LlmUsage(0, 0, 0m), "judge", Guid.NewGuid()));
+
+ public IAsyncEnumerable StreamAsync(LlmRequest request, CancellationToken ct) =>
+ throw new NotSupportedException();
+ }
+
+ private static StudyBuddyAgent Agent(ILlmService llm)
+ {
+ var registry = new ToolRegistry([]);
+ return new StudyBuddyAgent(new AgentLoop(llm, registry, new ToolDispatcher(registry)));
+ }
+
+ [Fact]
+ public async Task RunAsync_DirectAgent_ScoresGoldensAndAggregates()
+ {
+ var runner = new StudyBuddyEvalRunner(NullLogger.Instance);
+
+ var result = await runner.RunAsync(
+ Agent(new DirectLlm()), new FixedJudge(5, 4, 5), "judge-test",
+ Guid.NewGuid(), userId: null, new ServiceCollection().BuildServiceProvider(),
+ persist: false, db: null, gitSha: null, TestContext.Current.CancellationToken);
+
+ Assert.Equal(GoldenN, result.N);
+ Assert.Equal((5 + 4 + 5) / 3.0, result.JudgeScore, 3); // every case scored 4.67
+ Assert.Equal(1.0, result.AvgSteps, 3); // direct answer → one iteration each
+ Assert.Equal(0.003m, result.AvgCostUsd); // fixed per-run cost
+ Assert.All(result.Cases, c => Assert.True(c.Completed));
+ }
+}