Skip to content

Commit db1b2a8

Browse files
authored
test: evals per-case QA tests and inlined suites (#16424)
# Overview Refactors the eval suite to surface per-question pass/fail in vitest output and reduces boilerplate by inlining suite registration into the spec files. ## Key Changes - **Per-question test reporting** - QA datasets previously batched all questions into a single `it()` that asserted on aggregate accuracy. Each question now registers as its own `it()` block inside `describe.concurrent`, so vitest reports pass/fail per question and `-t` filtering targets individual cases. - Extracted `runQACase` from `runDataset` for per-case invocation. `runDataset` still exists as a wrapper that maps over cases and aggregates accuracy for the snapshot output. - **Inlined suite registration** - The 10 `register*Suite()` functions in `suites/` were thin wrappers around two patterns. Replaced with `registerQACases` and `registerCodegenCases` helpers in `suites/helpers.ts`; each spec file calls these directly with its dataset and label. - Deleted the per-suite files. The `suites/` directory now contains only `helpers.ts` and `types.ts`. - Dropped redundant nested wrappers in graphql/local-api/rest-api specs (single-child `describe('Collections')` / `describe('CRUD')`). - **API key check in globalSetup** - The duplicated `beforeAll` API key guard from 10 spec files moved into `globalSetup.setup()`. Spec files no longer import `beforeAll` or repeat the check. - **Richer failure output** - `failureMessage` and `caseFailureMessage` now include the model answer and scorer reasoning inline in the vitest assertion message. - Switched all `assert(condition, message)` calls in suites to `expect(value, message).toBe(...)` for native vitest reporter integration. - **Path alias support in fixture type-checking** - Added `fixtures/ambient.d.ts` with wildcard module declarations for `@/*` and `~/*`. The per-invocation tsconfig in `validate.ts` includes this file so LLM-generated configs that import from common path aliases type-check structurally without requiring real stub files per fixture. - **Removed unused ACCURACY_THRESHOLD** - The dataset-level accuracy gate is no longer applied since each case asserts independently. `SCORE_THRESHOLD` (per-case pass criterion for the LLM judge) remains. - **Cleanup** - Removed redundant `dotenv.config()` from `runner/systemPrompts.ts` (vitest setup already loads `.env`). - Switched bare `fs`/`path` imports to `node:` prefixed. ## Design Decisions - **Per-case over batched assertions**: The batched `it()` masked individual failures behind aggregate accuracy. With 30+ questions per dataset, the visibility tradeoff favored splitting. Each case now asserts independently and aggregate metrics are computed at snapshot/reporting time rather than in test code. - **Inlining over indirection**: With variant selection moved to a runtime env var (`EVAL_VARIANT` via `resolveVariantOptions()`), each spec calls register once. The `register*Suite()` function abstraction added a layer with no remaining purpose. Inlining puts the suite definition (name, datasets, structure) in one file the reader can scan top to bottom. - **`expect` over `assert`**: Both work, but `expect` integrates with vitest's reporter so the custom failure message is rendered alongside the standard test output rather than as a raw `AssertionError`. - **Ambient module declarations over per-fixture stub files**: LLMs commonly generate imports from `@/components/...` paths. Maintaining real stub files per import path would not scale. Wildcard module declarations let any `@/*` import resolve to `any`, which is sufficient for structural type-checking. ## Overall Flow ```mermaid flowchart TB subgraph Before A1[spec file] A1 --> A2[register Suite function] A2 --> A3[runDataset] A3 --> A4[Promise.all over cases] A4 --> A5[Single it asserts aggregate accuracy] end subgraph After B1[spec file] B1 --> B2[registerQACases helper] B2 --> B3[describe.concurrent] B3 --> B4[it per case] B4 --> B5[runQACase] B5 --> B6[expect with rich message] end ```
1 parent cb260b9 commit db1b2a8

32 files changed

Lines changed: 352 additions & 580 deletions

test/evals/datasets/config/codegen.ts

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,6 @@
11
import type { CodegenEvalCase } from '../../types.js'
22

33
export const configCodegenDataset: CodegenEvalCase[] = [
4-
{
5-
input:
6-
'Add the SEO plugin from "@payloadcms/plugin-seo" to the config. Pass a generateTitle function that returns a string combining the document title and the site name "Acme Corp".',
7-
expected:
8-
'import seoPlugin from "@payloadcms/plugin-seo", seoPlugin() added to plugins array, generateTitle function returning a string that includes the doc title and "Acme Corp"',
9-
category: 'config',
10-
fixturePath: 'config/codegen/seo-plugin',
11-
},
124
{
135
input:
146
'Configure the admin panel to use a custom Logo component imported from "@/components/Logo" and set the admin meta titleSuffix to " | My CMS".',
Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerBuildingPluginsSuite } from './suites/index.js'
3+
import { pluginsCodegenDataset } from './datasets/plugins/codegen.js'
4+
import { pluginsQADataset } from './datasets/plugins/qa.js'
5+
import { registerCodegenCases, registerQACases } from './suites/helpers.js'
46
import { resolveVariantOptions } from './variantOptions.js'
57

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
8+
const options = resolveVariantOptions()
9+
const { labelSuffix = '' } = options
1110

12-
registerBuildingPluginsSuite(resolveVariantOptions())
11+
describe(`Building Plugins${labelSuffix}`, () => {
12+
registerQACases(pluginsQADataset, 'Building Plugins: QA', options)
13+
registerCodegenCases(pluginsCodegenDataset, 'Building Plugins: Codegen', options)
14+
})
Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerCollectionsSuite } from './suites/index.js'
3+
import { collectionsCodegenDataset } from './datasets/collections/codegen.js'
4+
import { collectionsQADataset } from './datasets/collections/qa.js'
5+
import { registerCodegenCases, registerQACases } from './suites/helpers.js'
46
import { resolveVariantOptions } from './variantOptions.js'
57

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
8+
const options = resolveVariantOptions()
9+
const { labelSuffix = '' } = options
1110

12-
registerCollectionsSuite(resolveVariantOptions())
11+
describe(`Collections${labelSuffix}`, () => {
12+
registerQACases(collectionsQADataset, 'Collections: QA', options)
13+
registerCodegenCases(collectionsCodegenDataset, 'Collections: Codegen', options)
14+
})

test/evals/eval.config.spec.ts

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerConfigSuite } from './suites/index.js'
3+
import { configCodegenDataset } from './datasets/config/codegen.js'
4+
import { configQADataset } from './datasets/config/qa.js'
5+
import { registerCodegenCases, registerQACases } from './suites/helpers.js'
46
import { resolveVariantOptions } from './variantOptions.js'
57

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
8+
const options = resolveVariantOptions()
9+
const { labelSuffix = '' } = options
1110

12-
registerConfigSuite(resolveVariantOptions())
11+
describe(`Config${labelSuffix}`, () => {
12+
registerQACases(configQADataset, 'Config: QA', options)
13+
registerCodegenCases(configCodegenDataset, 'Config: Codegen', options)
14+
})
Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerConventionsSuite } from './suites/index.js'
3+
import { conventionsQADataset } from './datasets/conventions/qa.js'
4+
import { registerQACases } from './suites/helpers.js'
45
import { resolveVariantOptions } from './variantOptions.js'
56

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
7+
const options = resolveVariantOptions()
8+
const { labelSuffix = '' } = options
119

12-
registerConventionsSuite(resolveVariantOptions())
10+
describe(`Conventions${labelSuffix}`, () => {
11+
registerQACases(conventionsQADataset, 'Conventions: QA', options)
12+
})

test/evals/eval.fields.spec.ts

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerFieldsSuite } from './suites/index.js'
3+
import { fieldsCodegenDataset } from './datasets/fields/codegen.js'
4+
import { fieldsQADataset } from './datasets/fields/qa.js'
5+
import { registerCodegenCases, registerQACases } from './suites/helpers.js'
46
import { resolveVariantOptions } from './variantOptions.js'
57

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
8+
const options = resolveVariantOptions()
9+
const { labelSuffix = '' } = options
1110

12-
registerFieldsSuite(resolveVariantOptions())
11+
describe(`Fields${labelSuffix}`, () => {
12+
registerQACases(fieldsQADataset, 'Fields: QA', options)
13+
registerCodegenCases(fieldsCodegenDataset, 'Fields: Codegen', options)
14+
})

test/evals/eval.graphql.spec.ts

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerGraphQLSuite } from './suites/index.js'
3+
import { graphqlCollectionsQADataset } from './datasets/graphql/collections/qa.js'
4+
import { registerQACases } from './suites/helpers.js'
45
import { resolveVariantOptions } from './variantOptions.js'
56

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
7+
const options = resolveVariantOptions()
8+
const { labelSuffix = '' } = options
119

12-
registerGraphQLSuite(resolveVariantOptions())
10+
describe(`GraphQL${labelSuffix}`, () => {
11+
registerQACases(graphqlCollectionsQADataset, 'GraphQL: Collections QA', options)
12+
})

test/evals/eval.local-api.spec.ts

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerLocalApiSuite } from './suites/index.js'
3+
import { localApiCollectionsQADataset } from './datasets/local-api/collections/qa.js'
4+
import { registerQACases } from './suites/helpers.js'
45
import { resolveVariantOptions } from './variantOptions.js'
56

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
7+
const options = resolveVariantOptions()
8+
const { labelSuffix = '' } = options
119

12-
registerLocalApiSuite(resolveVariantOptions())
10+
describe(`Local API${labelSuffix}`, () => {
11+
registerQACases(localApiCollectionsQADataset, 'Local API: Collections QA', options)
12+
})

test/evals/eval.negative.spec.ts

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,29 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerNegativeSuite } from './suites/index.js'
3+
import {
4+
negativeCorrectionCodegenDataset,
5+
negativeInvalidInstructionDataset,
6+
} from './datasets/negative/codegen.js'
7+
import { negativeQADataset } from './datasets/negative/qa.js'
8+
import { registerCodegenCases, registerQACases } from './suites/helpers.js'
49
import { resolveVariantOptions } from './variantOptions.js'
510

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
11+
const options = resolveVariantOptions()
12+
const { labelSuffix = '' } = options
1113

12-
registerNegativeSuite(resolveVariantOptions())
14+
describe(`Negative Tests${labelSuffix}`, () => {
15+
registerQACases(negativeQADataset, 'Negative: Detection', {
16+
...options,
17+
systemPromptKeyOverride: 'configReview',
18+
})
19+
registerCodegenCases(negativeCorrectionCodegenDataset, 'Negative: Correction', {
20+
...options,
21+
groupName: 'Correction: Codegen',
22+
})
23+
registerCodegenCases(negativeInvalidInstructionDataset, 'Negative: Invalid Instruction', {
24+
...options,
25+
expectPass: false,
26+
groupName: 'Invalid Instruction: Codegen',
27+
testNamePrefix: 'should fail: ',
28+
})
29+
})
Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
import { beforeAll } from 'vitest'
1+
import { describe } from 'vitest'
22

3-
import { registerOfficialPluginsSuite } from './suites/index.js'
3+
import { pluginsOfficialCodegenDataset } from './datasets/plugins/official/codegen.js'
4+
import { pluginsOfficialQADataset } from './datasets/plugins/official/qa.js'
5+
import { registerCodegenCases, registerQACases } from './suites/helpers.js'
46
import { resolveVariantOptions } from './variantOptions.js'
57

6-
beforeAll(() => {
7-
if (!process.env.OPENAI_API_KEY) {
8-
throw new Error('OPENAI_API_KEY must be set to run eval tests')
9-
}
10-
})
8+
const options = resolveVariantOptions()
9+
const { labelSuffix = '' } = options
1110

12-
registerOfficialPluginsSuite(resolveVariantOptions())
11+
describe(`Official Plugins${labelSuffix}`, () => {
12+
registerQACases(pluginsOfficialQADataset, 'Official Plugins: QA', options)
13+
registerCodegenCases(pluginsOfficialCodegenDataset, 'Official Plugins: Codegen', options)
14+
})

0 commit comments

Comments
 (0)