Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

## [Unreleased]

### Phase 6 — Study Buddy golden set + eval — Phase 6 complete (2026-06-15)

Phase 6 **AI-039** — the DoD gate for the Study Buddy agent (judge ≥4/5, avg steps ≤4, cost <$0.05). **This closes Phase 6** (agent loop → tools → persistence → SSE endpoint → reader UI → eval).

- **`StudyBuddyEvalRunner`** (`Ai.EvalSuite`) — per golden: run the REAL agent against a real edition (its tools hit the live corpus), score the final answer against the reference with the same MEAI `RubricEvaluator` the suite uses (correctness / grounding / clarity), and record iterations + cost. A budget-exhausted run is a failed case (judge 0, capped steps). Persists a `studybuddy` `EvalRun` (judge mean 1–5; avg steps + cost in the breakdown).
- **Golden set** `studybuddy.json` (embedded) — 10 starter DDIA "confusing passage → reference explanation" cases; a **starter to curate to the DoD's 25** against the live edition (align chapter numbers to it).
- **`POST /admin/ai-quality/evals/studybuddy/run?editionId=&judge=`** — admin-triggered against a real embedded edition (503 when the judge isn't configured). The agent's tools resolve scoped services from the request scope.
- Tests: `StudyBuddyEvalRunner` with a direct-answering fake agent + fixed judge — scores the whole golden set, aggregates the judge mean, and computes avg steps (1.0) + cost deterministically. Golden count read from the dataset so it survives growth to 25.

### Phase 6 — Study Buddy wired into the reader (2026-06-15)

Phase 6 **AI-038, slice b** — the panel is now reachable: select a passage in the reader → "Help me understand this" → the agent investigates live.
Expand Down
52 changes: 52 additions & 0 deletions backend/src/Ai/TextStack.Ai.EvalSuite/Datasets/studybuddy.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
[
{
"passage": "The CAP theorem is sometimes stated as \"Consistency, Availability, Partition tolerance: pick two out of three.\" Unfortunately, putting it this way is misleading.",
"chapterNumber": 9,
"rubricAnswer": "Partitions aren't something you choose — they happen. So the real choice is only between consistency (linearizability) and availability, and only WHEN a network partition is occurring. The popular '2 of 3' phrasing is misleading because partition tolerance isn't optional."
},
{
"passage": "In a leaderless replication system, the client sends each write to several replicas in parallel, and read requests are also sent to several nodes in parallel.",
"chapterNumber": 5,
"rubricAnswer": "Leaderless (Dynamo-style) replication has no single leader; the client (or a coordinator) writes to and reads from several replicas at once. Quorum overlap (w + r > n) is what gives a good chance of reading an up-to-date value despite no leader ordering the writes."
},
{
"passage": "An advantage of append-only log-structured storage is that segment files are immutable, so concurrent reads do not need to take locks on them.",
"chapterNumber": 3,
"rubricAnswer": "Because segments are only ever appended to and never modified in place, a reader can scan a segment while writes go elsewhere — no read/write locking needed. Immutability also simplifies crash recovery, since a partially written record is just ignored."
},
{
"passage": "Serializable snapshot isolation (SSI) is an optimistic concurrency control technique.",
"chapterNumber": 7,
"rubricAnswer": "Unlike pessimistic two-phase locking, which blocks to prevent conflicts, SSI lets transactions run on a snapshot and only checks at commit time whether they actually conflicted; if so, one is aborted and retried. It is optimistic because it assumes conflicts are rare."
},
{
"passage": "A node in the network cannot reliably tell whether another node has crashed or is merely slow to respond.",
"chapterNumber": 8,
"rubricAnswer": "On an asynchronous network there is no way to distinguish a dead node from a slow one — a missing reply could mean the node crashed, the request was lost, or it's just delayed. Systems use timeouts to guess, but a timeout is only a heuristic, not a certainty."
},
{
"passage": "Linearizability is a recency guarantee: it makes a system appear as though there is only a single copy of the data.",
"chapterNumber": 9,
"rubricAnswer": "Linearizability means that once a write completes, every later read sees that value (or a newer one) — the replicated system behaves as if there were one copy updated atomically. It's a strong consistency guarantee, stronger than eventual consistency."
},
{
"passage": "Skewed workloads can cause hot spots, where one partition receives disproportionately more load than the others.",
"chapterNumber": 6,
"rubricAnswer": "If keys aren't spread evenly (e.g. a celebrity user, or many records sharing a partition key), one partition gets far more traffic than the rest — a hot spot — which limits scalability because that node becomes the bottleneck. Hashing keys or adding a random prefix helps spread the load."
},
{
"passage": "Derived data, such as a search index or a cache, can always be recreated from the underlying source data.",
"chapterNumber": 11,
"rubricAnswer": "Source-of-truth (system of record) data is authoritative; derived data (indexes, caches, materialized views) is a transformation of it and is redundant — if lost or corrupted, it can be rebuilt by re-running the transformation over the source. That's why derived data can be treated as disposable."
},
{
"passage": "With event sourcing, application state is determined by a log of immutable events that are only ever appended.",
"chapterNumber": 11,
"rubricAnswer": "Instead of storing only the current state, event sourcing records every change as an immutable event in an append-only log; the current state is derived by replaying those events. This keeps a full history and lets you rebuild new views, at the cost of having to fold the log into state."
},
{
"passage": "The danger of two-phase locking is that transactions can deadlock, each waiting for a lock the other holds.",
"chapterNumber": 7,
"rubricAnswer": "Two-phase locking (2PL) makes transactions acquire locks and hold them until commit; if transaction A holds a lock B wants and B holds a lock A wants, neither can proceed — a deadlock. The database detects the cycle and aborts one transaction so the other can continue."
}
]
134 changes: 134 additions & 0 deletions backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyEvalRunner.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
using Application.Agents;
using Application.Common.Interfaces;
using Domain.Entities;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using Microsoft.Extensions.Logging;
using TextStack.Ai.Core;
using TextStack.Ai.Evals;
using TextStack.Ai.Llm;

namespace TextStack.Ai.EvalSuite;

/// <summary>One golden's outcome — for the admin UI.</summary>
public sealed record StudyBuddyCase(string Passage, int Steps, decimal CostUsd, double JudgeScore, bool Completed);

/// <summary>
/// Result of a Study Buddy eval run (AI-039) — the Phase 6 DoD numbers: the judge mean (≥4/5 target),
/// average iterations per run (≤4 target: a planner that's neither lazy nor over-eager) and average
/// cost (&lt;$0.05 target).
/// </summary>
public sealed record StudyBuddyEvalResult(
double JudgeScore, double AvgSteps, decimal AvgCostUsd, int N, IReadOnlyList<StudyBuddyCase> Cases);

/// <summary>
/// Runs the Phase 6 Study Buddy eval (AI-039): for each golden passage, run the REAL agent against a
/// real edition (its tools hit the live corpus), score the final answer against the reference with the
/// same MEAI <see cref="RubricEvaluator"/> the rest of the suite uses, and record steps + cost. A
/// budget-exhausted run is a failed case (judge 0, steps = the cap). Persists a <c>studybuddy</c>
/// <see cref="EvalRun"/> (judge mean 1–5; avg steps + cost in the breakdown).
/// </summary>
public sealed class StudyBuddyEvalRunner(ILogger<StudyBuddyEvalRunner> logger)
{
private const string Feature = "studybuddy";

private static readonly Rubric Rubric = new(
"correctness: does the explanation accurately convey the meaning the reference explanation captures?",
"grounding: is it specific to THIS passage and free of invented facts (consistent with the book's domain)?",
"clarity: is it a clear, concise 3-5 sentence explanation a developer would find genuinely helpful?");

private static readonly ChatMessage[] JudgePlaceholderMessages = [new ChatMessage(ChatRole.User, string.Empty)];

public async Task<StudyBuddyEvalResult> RunAsync(
StudyBuddyAgent agent,
ILlmService judge,
string judgeModelId,
Guid editionId,
Guid? userId,
IServiceProvider services,
bool persist,
IAppDbContext? db,
string? gitSha,
CancellationToken ct)
{
var goldens = StudyBuddyGoldenSet.Load();
var chatConfig = new ChatConfiguration(new LlmServiceChatClient(judge, defaultFeatureTag: "eval.judge"));

var cases = new List<StudyBuddyCase>();
var scores = new List<JudgeScore>();
var totalSteps = 0;
var totalCost = 0m;

foreach (var g in goldens)
{
ct.ThrowIfCancellationRequested();

// A fresh run id per golden; the agent's tools resolve scoped services from `services`.
var agentCtx = new AgentContext(userId, editionId, Guid.NewGuid(), services);
string answer;
int steps;
decimal cost;
try
{
var result = await agent.RunAsync(new StudyBuddyInput(g.Passage, editionId, g.ChapterNumber), agentCtx, ct);
answer = result.Output;
steps = result.Usage.Iterations;
cost = result.Usage.CostUsdTotal;
}
catch (AgentBudgetExhaustedException ex)
{
// Hitting the cap is a real failure mode: score it 0 and count its (capped) steps/cost.
cases.Add(new StudyBuddyCase(g.Passage, ex.Usage.Iterations, ex.Usage.CostUsdTotal, 0, Completed: false));
scores.Add(new JudgeScore(0, 0, 0, "budget exhausted"));
totalSteps += ex.Usage.Iterations;
totalCost += ex.Usage.CostUsdTotal;
continue;
}

var evidence =
$"Passage:\n{g.Passage}\n\nReference explanation:\n{g.RubricAnswer}\n\nAgent's explanation:\n{answer}";
var evaluator = new RubricEvaluator(Feature, Rubric);
var result2 = await evaluator.EvaluateAsync(
JudgePlaceholderMessages, new ChatResponse(new ChatMessage(ChatRole.Assistant, answer)),
chatConfig, [new RubricEvidenceContext(evidence)], ct);

var score = new JudgeScore(
ReadAxis(result2, Rubric.Dim1), ReadAxis(result2, Rubric.Dim2), ReadAxis(result2, Rubric.Dim3), string.Empty);
scores.Add(score);
cases.Add(new StudyBuddyCase(g.Passage, steps, cost, score.Mean, Completed: true));
totalSteps += steps;
totalCost += cost;
}

var n = cases.Count;
var judgeMean = scores.Count > 0 ? JudgeRunner.Aggregate(scores).MeanOverall : 0;
var avgSteps = n > 0 ? (double)totalSteps / n : 0;
var avgCost = n > 0 ? totalCost / n : 0m;

logger.LogInformation(
"Study Buddy eval edition={Edition} judge={Judge:0.00} avgSteps={Steps:0.0} avgCost=${Cost} (N={N})",
editionId, judgeMean, avgSteps, avgCost, n);

if (persist && db is not null)
{
db.EvalRuns.Add(new EvalRun
{
Id = Guid.NewGuid(),
Feature = Feature,
ModelId = "agent",
JudgeModelId = judgeModelId,
Score = Math.Round((decimal)judgeMean, 3),
N = n,
BreakdownJson = $"{{\"avgSteps\":{avgSteps:0.00},\"avgCostUsd\":{avgCost:0.0000},\"completed\":{cases.Count(c => c.Completed)}}}",
GitSha = gitSha,
CreatedAt = DateTimeOffset.UtcNow,
});
await db.SaveChangesAsync(ct);
}

return new StudyBuddyEvalResult(judgeMean, avgSteps, avgCost, n, cases);
}

private static int ReadAxis(EvaluationResult result, string dim) =>
(int)Math.Round(result.Get<NumericMetric>($"{Feature}.{dim.Split(':')[0].Trim()}").Value ?? 0);
}
14 changes: 14 additions & 0 deletions backend/src/Ai/TextStack.Ai.EvalSuite/StudyBuddyGolden.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
namespace TextStack.Ai.EvalSuite;

/// <summary>
/// One Study Buddy golden case (AI-039): a confusing passage (with its chapter for context) plus a
/// reference explanation the judge scores the agent's answer against. The passages target the eval
/// edition (DDIA); align <see cref="ChapterNumber"/> to that edition's numbering.
/// </summary>
public record StudyBuddyGolden(string Passage, int? ChapterNumber, string RubricAnswer);

/// <summary>Loads the embedded Study Buddy golden set (<c>studybuddy.json</c>).</summary>
public static class StudyBuddyGoldenSet
{
public static IReadOnlyList<StudyBuddyGolden> Load() => GoldenLoader.Load<StudyBuddyGolden>("studybuddy.json");
}
55 changes: 55 additions & 0 deletions backend/src/Api/Endpoints/AdminAiQualityEndpoints.cs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.DependencyInjection;
using TextStack.Ai.Core;
using Application.Agents;
using TextStack.Ai.EvalSuite;
using TextStack.Ai.Tools;

Expand All @@ -28,6 +29,60 @@ public static void MapAdminAiQualityEndpoints(this WebApplication app)
group.MapPost("/evals/run", RunEvals);
group.MapGet("/evals/status", GetEvalStatus);
group.MapPost("/evals/toolcalls/run", RunToolCallEval);
group.MapPost("/evals/studybuddy/run", RunStudyBuddyEval);
}

// Phase 6 DoD gate (AI-039): runs the Study Buddy agent over the golden passages against a real
// edition and scores the answers + records steps/cost. Needs an embedded edition (DDIA) + a key.
private static async Task<IResult> RunStudyBuddyEval(
[FromQuery] Guid editionId,
[FromQuery] string? judge,
HttpContext httpContext,
IServiceProvider services,
IConfiguration config,
StudyBuddyEvalRunner runner,
StudyBuddyAgent agent,
IAppDbContext db,
CancellationToken ct)
{
if (editionId == Guid.Empty)
return Results.BadRequest(new { error = "editionId query parameter is required." });

var useOllama = string.Equals(judge, "ollama", StringComparison.OrdinalIgnoreCase);
var judgeKey = useOllama ? "ollama" : "openai-judge";
var judgeModelId = useOllama ? config["Ollama:Model"] ?? "gemma4:e4b" : config["Eval:JudgeModel"] ?? "gpt-4.1";

ILlmService judgeClient;
try
{
judgeClient = services.GetRequiredKeyedService<ILlmService>(judgeKey);
}
catch (InvalidOperationException)
{
return Results.Problem("Judge LLM is not configured.", statusCode: 503);
}

var gitSha = Environment.GetEnvironmentVariable("GIT_SHA");
// The agent's tools resolve scoped services (db, retrieval) from the request scope.
var result = await runner.RunAsync(
agent, judgeClient, judgeModelId, editionId, userId: null, httpContext.RequestServices,
persist: true, db, gitSha, ct);

return Results.Ok(new
{
judgeScore = Math.Round(result.JudgeScore, 3),
avgSteps = Math.Round(result.AvgSteps, 2),
avgCostUsd = result.AvgCostUsd,
n = result.N,
cases = result.Cases.Select(c => new
{
passage = c.Passage.Length > 80 ? c.Passage[..80] + "…" : c.Passage,
c.Steps,
c.CostUsd,
c.JudgeScore,
c.Completed,
}),
});
}

// Phase 5 DoD gate (AI-033): deterministic tool-call accuracy over the embedded golden set.
Expand Down
1 change: 1 addition & 0 deletions backend/src/Api/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@
builder.Services.AddSingleton<TextStack.Ai.EvalSuite.EvalSuiteRunner>();
builder.Services.AddSingleton<TextStack.Ai.EvalSuite.RagEvalRunner>();
builder.Services.AddSingleton<TextStack.Ai.EvalSuite.ToolCallEvalRunner>();
builder.Services.AddSingleton<TextStack.Ai.EvalSuite.StudyBuddyEvalRunner>();
// Tool catalogue (AI-029/030): scans Application for ITool impls; dispatch is schema-validated.
builder.Services.AddAiTools(typeof(Application.Tools.GetChapterTool).Assembly);
// Agent loop engine (Phase 6, AI-034). Concrete agents (StudyBuddy, AI-035) build on it.
Expand Down
63 changes: 63 additions & 0 deletions tests/TextStack.AiEvals/StudyBuddyEvalRunnerTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
using Application.Agents;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging.Abstractions;
using TextStack.Ai.Agents;
using TextStack.Ai.Core;
using TextStack.Ai.EvalSuite;
using TextStack.Ai.Tools;

namespace TextStack.AiEvals;

/// <summary>
/// Deterministic coverage for <see cref="StudyBuddyEvalRunner"/> (AI-039): the agent runs on a fake
/// LLM that answers directly (no tools, one iteration), the judge is fixed — so the run scores the
/// golden set, aggregates the judge mean, and computes avg steps + cost without a key or corpus.
/// </summary>
public class StudyBuddyEvalRunnerTests
{
private static readonly int GoldenN = StudyBuddyGoldenSet.Load().Count;

/// <summary>Answers every prompt directly (no tool calls) with a fixed cost — one iteration per run.</summary>
private sealed class DirectLlm : ILlmService
{
public Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct) =>
Task.FromResult(new LlmResponse("A grounded explanation of the passage.", [], new LlmUsage(40, 20, 0.003m), "agent-model", Guid.NewGuid()));

public IAsyncEnumerable<LlmDelta> StreamAsync(LlmRequest request, CancellationToken ct) =>
throw new NotSupportedException();
}

private sealed class FixedJudge(int d1, int d2, int d3) : ILlmService
{
public Task<LlmResponse> CompleteAsync(LlmRequest request, CancellationToken ct) =>
Task.FromResult(new LlmResponse(
$"{{\"d1\": {d1}, \"d2\": {d2}, \"d3\": {d3}, \"rationale\": \"ok\"}}",
[], new LlmUsage(0, 0, 0m), "judge", Guid.NewGuid()));

public IAsyncEnumerable<LlmDelta> StreamAsync(LlmRequest request, CancellationToken ct) =>
throw new NotSupportedException();
}

private static StudyBuddyAgent Agent(ILlmService llm)
{
var registry = new ToolRegistry([]);
return new StudyBuddyAgent(new AgentLoop(llm, registry, new ToolDispatcher(registry)));
}

[Fact]
public async Task RunAsync_DirectAgent_ScoresGoldensAndAggregates()
{
var runner = new StudyBuddyEvalRunner(NullLogger<StudyBuddyEvalRunner>.Instance);

var result = await runner.RunAsync(
Agent(new DirectLlm()), new FixedJudge(5, 4, 5), "judge-test",
Guid.NewGuid(), userId: null, new ServiceCollection().BuildServiceProvider(),
persist: false, db: null, gitSha: null, TestContext.Current.CancellationToken);

Assert.Equal(GoldenN, result.N);
Assert.Equal((5 + 4 + 5) / 3.0, result.JudgeScore, 3); // every case scored 4.67
Assert.Equal(1.0, result.AvgSteps, 3); // direct answer → one iteration each
Assert.Equal(0.003m, result.AvgCostUsd); // fixed per-run cost
Assert.All(result.Cases, c => Assert.True(c.Completed));
}
}
Loading