Commit db1b2a8
authored
test: evals per-case QA tests and inlined suites (#16424)
# Overview
Refactors the eval suite to surface per-question pass/fail in vitest
output and reduces boilerplate by inlining suite registration into the
spec files.
## Key Changes
- **Per-question test reporting**
- QA datasets previously batched all questions into a single `it()` that
asserted on aggregate accuracy. Each question now registers as its own
`it()` block inside `describe.concurrent`, so vitest reports pass/fail
per question and `-t` filtering targets individual cases.
- Extracted `runQACase` from `runDataset` for per-case invocation.
`runDataset` still exists as a wrapper that maps over cases and
aggregates accuracy for the snapshot output.
- **Inlined suite registration**
- The 10 `register*Suite()` functions in `suites/` were thin wrappers
around two patterns. Replaced with `registerQACases` and
`registerCodegenCases` helpers in `suites/helpers.ts`; each spec file
calls these directly with its dataset and label.
- Deleted the per-suite files. The `suites/` directory now contains only
`helpers.ts` and `types.ts`.
- Dropped redundant nested wrappers in graphql/local-api/rest-api specs
(single-child `describe('Collections')` / `describe('CRUD')`).
- **API key check in globalSetup**
- The duplicated `beforeAll` API key guard from 10 spec files moved into
`globalSetup.setup()`. Spec files no longer import `beforeAll` or repeat
the check.
- **Richer failure output**
- `failureMessage` and `caseFailureMessage` now include the model answer
and scorer reasoning inline in the vitest assertion message.
- Switched all `assert(condition, message)` calls in suites to
`expect(value, message).toBe(...)` for native vitest reporter
integration.
- **Path alias support in fixture type-checking**
- Added `fixtures/ambient.d.ts` with wildcard module declarations for
`@/*` and `~/*`. The per-invocation tsconfig in `validate.ts` includes
this file so LLM-generated configs that import from common path aliases
type-check structurally without requiring real stub files per fixture.
- **Removed unused ACCURACY_THRESHOLD**
- The dataset-level accuracy gate is no longer applied since each case
asserts independently. `SCORE_THRESHOLD` (per-case pass criterion for
the LLM judge) remains.
- **Cleanup**
- Removed redundant `dotenv.config()` from `runner/systemPrompts.ts`
(vitest setup already loads `.env`).
- Switched bare `fs`/`path` imports to `node:` prefixed.
## Design Decisions
- **Per-case over batched assertions**: The batched `it()` masked
individual failures behind aggregate accuracy. With 30+ questions per
dataset, the visibility tradeoff favored splitting. Each case now
asserts independently and aggregate metrics are computed at
snapshot/reporting time rather than in test code.
- **Inlining over indirection**: With variant selection moved to a
runtime env var (`EVAL_VARIANT` via `resolveVariantOptions()`), each
spec calls register once. The `register*Suite()` function abstraction
added a layer with no remaining purpose. Inlining puts the suite
definition (name, datasets, structure) in one file the reader can scan
top to bottom.
- **`expect` over `assert`**: Both work, but `expect` integrates with
vitest's reporter so the custom failure message is rendered alongside
the standard test output rather than as a raw `AssertionError`.
- **Ambient module declarations over per-fixture stub files**: LLMs
commonly generate imports from `@/components/...` paths. Maintaining
real stub files per import path would not scale. Wildcard module
declarations let any `@/*` import resolve to `any`, which is sufficient
for structural type-checking.
## Overall Flow
```mermaid
flowchart TB
subgraph Before
A1[spec file]
A1 --> A2[register Suite function]
A2 --> A3[runDataset]
A3 --> A4[Promise.all over cases]
A4 --> A5[Single it asserts aggregate accuracy]
end
subgraph After
B1[spec file]
B1 --> B2[registerQACases helper]
B2 --> B3[describe.concurrent]
B3 --> B4[it per case]
B4 --> B5[runQACase]
B5 --> B6[expect with rich message]
end
```1 parent cb260b9 commit db1b2a8
32 files changed
Lines changed: 352 additions & 580 deletions
File tree
- test/evals
- datasets/config
- fixtures
- runner
- suites
- utils
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
5 | | - | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | 4 | | |
13 | 5 | | |
14 | 6 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 8 | + | |
| 9 | + | |
11 | 10 | | |
12 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 8 | + | |
| 9 | + | |
11 | 10 | | |
12 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 8 | + | |
| 9 | + | |
11 | 10 | | |
12 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 7 | + | |
| 8 | + | |
11 | 9 | | |
12 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 8 | + | |
| 9 | + | |
11 | 10 | | |
12 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 7 | + | |
| 8 | + | |
11 | 9 | | |
12 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 7 | + | |
| 8 | + | |
11 | 9 | | |
12 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
4 | 9 | | |
5 | 10 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 11 | + | |
| 12 | + | |
11 | 13 | | |
12 | | - | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | | - | |
7 | | - | |
8 | | - | |
9 | | - | |
10 | | - | |
| 8 | + | |
| 9 | + | |
11 | 10 | | |
12 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
0 commit comments