feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry#670
feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry#670kaiapeacock-eng wants to merge 7 commits into
Conversation
Pulls in the open scenario PRs (rr7-data, react-vite) and adds the
remaining three Ring 1 scenarios per docs/evals.md:
* nextjs-app-router/pre-existing-vendor — exercises augment-don't-replace
with a stale @amplitude/analytics-browser install, golden shows clean
migration to @amplitude/unified
* expo/vanilla — Expo Router app, EXPO_PUBLIC_ env prefix, Expo SDK
web export as the build command (no Xcode/Android tooling needed)
* generic/probe — stripped vanilla HTML/JS app with framework markers
removed; useDetection=false so we grade the wizard's detection
pipeline against `generic` as ground truth
Plus supporting work:
* Layer 0 correct-sdk-package now hard-fails on stale-legacy-SDK
coexistence (failure mode #10: both @amplitude/unified and
@amplitude/analytics-browser surviving a pre-existing-vendor run)
* Framework→SDK table picks up `react-router` and `expo` aliases
* Layer 1 idempotent-rerun scorer (criterion 16) — opt-in via
Artifact.secondRunLog; skip-passes with weight 0 when absent
* Runner test parameterized over all seven Ring 1 goldens
* PR-gate matrix lists every Ring 1 scenario
Verified: pnpm exec vitest run evals/ — 94/94 across 13 files.
* Register L1-idempotent-rerun (criterion 16) in score.ts. Replays
that pin a `golden/run-second.ndjson` get full coverage; absence
skip-passes with weight 0.
* Add L1-self-verification-passes (criterion 17). Reads the
`verification_result` events from the run log and asserts every
phase succeeded. Maps failure reasons into the scorer's detail so
triagers see the offending phase without opening the artifact.
* Plumb optional secondRunLog through replay loading.
* Unit tests for both scorers (10 new tests, 100 total).
Verified: pnpm exec vitest run evals/ — 100/100 across 14 files.
Layer 4 boots the post-wizard working tree in a headless browser,
loads a known route, and grades:
* pageStatusCode (2xx/3xx required)
* uncaught console errors (any → fail with the first message)
* outbound Amplitude requests (≥1 required — proves the SDK actually
ran instead of being a tree-shaken no-op import)
Implementation notes:
* `evals/runner/runtime.ts` — runRuntimeProbe() spawns the dev
server, waits for the port, dynamic-imports playwright, navigates,
and intercepts Amplitude requests so they never forward.
* Dynamic playwright import means the framework compiles + tests
cleanly without the dependency. Nightly runners install playwright
explicitly; PR-gate runs skip-pass with a clear "playwright not
installed" detail.
* `runtimeProbe` field added to scenario.json schema (devCommand,
route, port, optional bootTimeoutMs).
* `--runtime` flag on bin/run-eval.ts opts in.
* golden/runtime-result.json pins recorded outcomes for replay
coverage.
Verified: pnpm exec vitest run evals/ — 110/110 across 16 files.
Layer 6 hands a sealed view of the post-wizard diff + setup_complete
+ confirmed events to a Claude Sonnet judge prompted with the 19-point
rubric (versioned via evals/rubrics/rubric-version.txt). The judge
returns one structured-JSON verdict per criterion it grades, each
required to carry an evidence_path + line — uncited verdicts are
treated as flakes and dropped.
Implementation:
* evals/rubrics/judge-prompt.md — canonical system prompt; spells
out the output JSON contract, what to grade (criteria 7, 15, 19
plus free-form taste), and what to leave to deterministic layers.
* evals/rubrics/rubric-version.txt — bumped on every rubric change
so score drift correlates with rubric changes.
* evals/runner/judge.ts — runJudge() loads the rubric, builds the
user message (scenario + diff + setup_complete + events), calls
@anthropic-ai/sdk, validates the response against Zod schemas,
returns a JudgeResult.
* evals/scorers/layer6-judge/judge-verdict.ts — aggregates verdicts
into one ScorerResult; weight is the sum of per-criterion weights;
failure detail names the offending criteria with citations.
* --judge flag on bin/run-eval.ts opts in.
* Optional golden/judge-result.json for replay coverage.
Auth surface: ANTHROPIC_API_KEY today; gateway routing via the
wizard's own OAuth path is the next step.
Verified: 122/122 tests across 18 files; lint clean.
Adds the spec's nightly CI gate plus the harness that catches
non-deterministic scenarios across seeds.
* .github/workflows/evals-nightly.yml — daily 07:00 UTC cron + manual
dispatch, full Ring 1+2 matrix × 2 seeds, installs playwright for
the Layer 4 probe, runs --judge for Layer 6, uploads per-seed
reports plus a variance summary.
* evals/bin/variance-summary.ts — scans all report.json files,
groups by scenario, computes seed-to-seed score spread, flags any
scenario with spread > 10 (the spec's non-determinism threshold)
and exits non-zero so CI surfaces the issue.
* --seed flag on evals:run plus EVAL_SEED env var passthrough so the
nightly matrix can pin a per-job seed onto the artifact and the
variance summary can stitch reports together.
* Unit test for the variance summary (passes within threshold,
flags + exits non-zero when spread > 10).
Verified: 124/124 tests across 19 files; lint clean.
Spell out the layer status, document Layer 5 as the only deferred piece (blocked on the eval-only Amplitude project — open decision #2), and refresh the quick-start commands now that --runtime / --judge / --seed are wired. Verified: pnpm test — 3573/3573 across 245 test files; lint clean.
Adds a hard kill-switch on the wizard's internal analytics so eval synthetic runs never reach the prod analytics project. With this in place, the eval suite no longer needs an "eval-only Amplitude project" (decision #2 in the spec) — synthetic runs are silent. * bin.ts — new --no-telemetry flag, also honored via AMPLITUDE_WIZARD_NO_TELEMETRY=1. * src/utils/analytics.ts — Analytics.disable() / isDisabled(). capture() and identifyUser() short-circuit before the SDK is touched (no init, no track, no network). Idempotent. * src/commands/default.ts — applies disable() before any other code path so the very first event ('session started') short-circuits inside the singleton. * evals/runner/invoke-wizard.ts — runLive() forwards both the flag and the env var to the wizard child, belt-and-braces. Layer 5 (ingestion verification) is now formally out of scope for the eval suite — owned by e2e-tests/. The README + spec document the boundary: Layer 4 covers "did the SDK fire ≥1 outbound request"; the ingestion question (does the request reach Amplitude under real network/hosting/bundling shape) belongs to e2e tests, not regression evals on prompt/skill changes. Test coverage: 4 new tests on Analytics.disable() + isDisabled() (track/identify short-circuit when disabled). Total 3577/3577 across 245 test files; lint clean.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Nightly workflow artifact names contain invalid slash characters
- Added a sanitization step to replace forward slashes with hyphens in scenario names before using them in artifact names.
- ✅ Fixed: Runtime probe reports ok=true for 5xx status codes
- Changed the ok flag condition to require status codes in the 2xx-3xx range (>= 200 && < 400) instead of just > 0.
Or push these changes by commenting:
@cursor push 7666572eeb
Preview (7666572eeb)
diff --git a/.github/workflows/evals-nightly.yml b/.github/workflows/evals-nightly.yml
--- a/.github/workflows/evals-nightly.yml
+++ b/.github/workflows/evals-nightly.yml
@@ -81,11 +81,15 @@
EVAL_SEED: ${{ matrix.seed }}
run: pnpm evals:run ${{ matrix.scenario }} --runtime --judge
+ - name: Sanitize scenario name
+ id: sanitize
+ run: echo "name=$(echo '${{ matrix.scenario }}' | tr '/' '-')" >> $GITHUB_OUTPUT
+
- name: Upload report
if: always()
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
with:
- name: eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }}
+ name: eval-report-${{ steps.sanitize.outputs.name }}-seed${{ matrix.seed }}
path: evals/reports/**/report.json
retention-days: 30
if-no-files-found: ignore
diff --git a/evals/runner/runtime.ts b/evals/runner/runtime.ts
--- a/evals/runner/runtime.ts
+++ b/evals/runner/runtime.ts
@@ -253,7 +253,10 @@
waitUntil: 'networkidle',
});
pageStatusCode = response?.status() ?? 0;
- ok = consoleErrors.length === 0 && pageStatusCode > 0;
+ ok =
+ consoleErrors.length === 0 &&
+ pageStatusCode >= 200 &&
+ pageStatusCode < 400;
} finally {
await browser.close();
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 7734647. Configure here.
| if: always() | ||
| uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4 | ||
| with: | ||
| name: eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} |
There was a problem hiding this comment.
Nightly workflow artifact names contain invalid slash characters
High Severity
The artifact name eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} will contain forward slashes from scenario names like nextjs-app-router/vanilla, producing names like eval-report-nextjs-app-router/vanilla-seed1. GitHub Actions upload-artifact v4 rejects artifact names containing / characters, causing every upload step in the nightly matrix to fail. The download-artifact step in the variance-summary job would then find no reports.
Reviewed by Cursor Bugbot for commit 7734647. Configure here.
| waitUntil: 'networkidle', | ||
| }); | ||
| pageStatusCode = response?.status() ?? 0; | ||
| ok = consoleErrors.length === 0 && pageStatusCode > 0; |
There was a problem hiding this comment.
Runtime probe reports ok=true for 5xx status codes
Low Severity
The ok flag is computed as consoleErrors.length === 0 && pageStatusCode > 0, meaning a 500 response with no console errors sets ok: true. The RuntimeResult.ok docstring says "True when the probe booted, navigated, and exited cleanly," but a 5xx is not a clean exit. The L4 scorer happens to catch this via a separate status-code check, but the ok field on pinned golden/runtime-result.json files could mislead future scorers or tooling that trusts it at face value.
Reviewed by Cursor Bugbot for commit 7734647. Configure here.
kelsonpw
left a comment
There was a problem hiding this comment.
Solid Phase-1 landing — Ring 1 scenarios, scorers, judge prompt, and the variance harness all hang together cleanly. Every Ring 1 CI check is green at HEAD (incl. all 7 scenarios, scorer vitest, build, lint), and the test coverage on the new scorers/runner is appropriate for a contained eval subsystem.
One blocker before merge:
.github/workflows/evals-nightly.yml:88 — the Bugbot HIGH finding is real and unaddressed at HEAD (7734647). Artifact name eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} interpolates scenario IDs like nextjs-app-router/vanilla, and actions/upload-artifact@v4 rejects names containing /. Every nightly upload step will fail and the download-artifact consumer in variance-summary will get nothing. Bugbot's autofix (sanitize step that tr '/' '-') is the right shape — pull it in or apply equivalent. Same fix should be considered for the artifact name in any future workflow that interpolates scenario paths.
Non-blocking notes:
evals/runner/runtime.ts:256—ok = consoleErrors.length === 0 && pageStatusCode > 0returns true for 5xx pages with no console errors. The L4 scorer catches it via a separate status check, but the field's docstring promises a "clean exit." Worth tightening topageStatusCode >= 200 && pageStatusCode < 400so pinnedgolden/runtime-result.jsonfiles reflect reality. Could be follow-up.- 109 files / ~3.7k LOC is large but the bulk is fixtures and goldens, which is what you want for a scenario suite. Reviewable.
Re-request review after the artifact-name fix lands and I'll approve.
kelsonpw
left a comment
There was a problem hiding this comment.
Heads up — this is ~3.7k LOC across 105 files in one PR. That's a lot to review with confidence in a single pass, and I'd hate for substantive issues to slip past because reviewers run out of steam.
Looking at the commit list, the natural seams are already pretty clean:
- Ring 1 scenarios + Layer 0 stale-legacy hard-fail (commit
cf8554dc) — fixtures + goldens + one scorer tweak - Layer 1 idempotent-rerun + self-verification scorers (commit
c2328b20) - Layer 4 runtime probe (commit
6359cae5) — Playwright dynamic-import + scenario schema bump - Layer 6 LLM judge (commit
2301dd4b) — rubric + Anthropic SDK call + judge scorer - Nightly workflow + variance harness (commit
877c2586) --no-telemetryflag + Layer 5 deferral (commit7734647c) — touchesbin.ts,analytics.ts,commands/default.ts(orthogonal to evals)
Would it be possible to land this as ~3 sequential PRs? E.g.:
- #670a — Ring 1 scenarios + Layer 1 scorers (fixtures + deterministic scoring, no new external deps)
- #670b — Layer 4 + Layer 6 + nightly workflow (Playwright + judge + variance harness — the new external surfaces)
- #670c —
--no-telemetry(the only commit that touches wizard runtime code outsideevals/; reviewable independently by the analytics owners)
Not blocking — happy to power through if it has to ship as one unit, especially given the spec calls Phase 1 a single milestone — but smaller PRs would land faster and get higher-confidence reviews. Calling it out.



Summary
Finishes Phase 1 of the SDK-integration eval suite per
docs/evals.md. The PR-gate ships value today; the nightly + variance harness lights up on the next 07:00 UTC cron; Layer 5 (ingestion) is now formally out of scope and owned bye2e-tests/.7 commits, 105 files changed, ~3.7k LOC added.
What's in this PR
Ring 1 scenarios — 7 of 7 shipped
nextjs-app-router/vanilla(was on main)nextjs-app-router/pre-existing-vendor— exercises augment-don't-replace; Layer 0 now hard-fails on stale-legacy-SDK coexistence (failure mode AMP-150672 Point local dev at Langley wizard proxy #10)react-router-7/framework(was on main)react-router-7/data— pulled from PR feat(evals): Ring 1 scenario — react-router-7/data #575react-vite/vanilla— pulled from PR feat(evals): Ring 1 scenario — react-vite/vanilla #576expo/vanilla— Expo Router, EXPO_PUBLIC_ env prefix,expo export --platform webbuildgeneric/probe— stripped vanilla HTML/JS,useDetection: falseso we grade the wizard's own detection pipelineEach scenario has a lockfile-pinned pristine fixture, a hand-authored golden, and lands a passing 90/90 score on the existing scorer stack.
New scorers
Artifact.secondRunLog; replays can pingolden/run-second.ndjsonverification_resultevents, fails if any phase reportssuccess=falseLayer 6 LLM judge
evals/rubrics/judge-prompt.md— canonical system promptevals/rubrics/rubric-version.txt— bumped on every rubric change so score drift correlates with rubric changesevals/runner/judge.ts— calls Anthropic via@anthropic-ai/sdk, validates response with Zod, drops uncited verdicts as flakesevidence_path+evidence_line_startCI workflows
.github/workflows/evals-pr-scenarios.yml— Ring 1 matrix gets all 7 scenarios (was 1).github/workflows/evals-nightly.yml— daily 07:00 UTC + manual dispatch, full Ring 1+2 × 2 seeds, installs Playwright for Layer 4, runs--judgefor Layer 6, uploads per-seed reports + a separate variance-summary artifactVariance harness
evals/bin/variance-summary.ts— scans everyreport.jsonunder the artifact-download root, groups by scenario, computes seed-to-seed score spread, flags any scenario > 10 points (the spec's non-determinism threshold) and exits non-zero--no-telemetry (resolves spec decision #2)
bin.tsflag +AMPLITUDE_WIZARD_NO_TELEMETRY=1env varAnalytics.disable()/isDisabled()— hard switch that short-circuits bothcapture()andidentifyUser()before the SDK touches anythingLayer 5 — formally out of scope
End-to-end ingestion verification is now owned by
e2e-tests/test-applications/, not the eval suite. Layer 4 covers "did the SDK fire ≥1 outbound request"; whether that request reaches Amplitude under real network/hosting/bundling shape belongs to e2e tests, not regression evals on prompt/skill changes. README + spec document the boundary.Value the PR delivers today
The PR gate runs Layers 0–3 against 7 Ring 1 goldens, wall-clock <8 min, and catches:
Hard fails
init()calls in the same projectnext.config.*/vite.config.*/ etc.)Heavy fails
track()calllib/amplitude.tsre-export wrapperMedium fails
setup_completeevent lying about which files were writtenWarns
The 3 negative-control tests (added in #577) are the existence proof — they introduce these regressions into a passing scenario and confirm the right scorer at the right layer fires.
Coverage scope: the gate fires on PRs touching
commandments.ts,agent-runner.ts,agent-interface.ts,wizard-tools.ts,agent-events.ts,exit-codes.ts,src/frameworks/**,src/ui/agent-ui.ts,skills/integration/**, orevals/**. So every prompt change, skill change, framework config change, or wire-format change gets graded against 7 reference integrations before it can land.Required follow-up: wizard-side OAuth wiring
This PR does not enable live runs. The eval runner already forwards
WIZARD_OAUTH_TOKEN/WIZARD_EXPIRES_AT/WIZARD_ZONEto the wizard child process, but the wizard itself doesn't read those env vars yet. Until that wiring lands, the only options for live runs are:EVALS_ALLOW_API_KEY_BYPASS=1+--api-key— opt-in fallback. Routes LLM calls direct-to-Anthropic, skipping the Amplitude LLM gateway. This cannot catch gateway-specific bugs (Vertex schema noise, beta-header rejections, the proxy 400 class). The runner prints a warning when this mode is used.WIZARD_OAUTH_TOKENis read on the wizard side, gateway-routed live evals work in CI without the bypass.The follow-up PR needs to:
WIZARD_OAUTH_TOKENat wizard startup as an alternative to the OAuth browser flowWIZARD_EXPIRES_AT(skip refresh if still valid; refresh otherwise)WIZARD_ZONEfor region routingUntil that lands:
ANTHROPIC_API_KEYdirectly (not gateway-routed yet — that's a separate but related follow-up)Other follow-ups (unblocked, not gating)
@anthropic-ai/sdkdirectly. Routing through the wizard's LLM gateway would share the same auth + rate-limit story.single-init-call.tsuses regex; can false-positive on commented-out callsites. AST move bumps it from L0 to L2.Integrationenum has 18 entries andskills/integration/has ~32. The skill/detector drift sidebar indocs/evals.mdlays out the gap.Test plan
pnpm exec vitest run evals/— 124/124 across 19 filespnpm test(full suite) — 3577/3577 across 245 filespnpm lint— clean (1 pre-existing warning, unrelated)pnpm evals:run nextjs-app-router/vanilla— 90/90, contractOkpnpm evals:run nextjs-app-router/pre-existing-vendor— 90/90, contractOkpnpm evals:run react-router-7/framework— 90/90, contractOkpnpm evals:run react-router-7/data— 90/90, contractOkpnpm evals:run react-vite/vanilla— 90/90, contractOkpnpm evals:run expo/vanilla— 90/90, contractOkpnpm evals:run generic/probe— 90/90, contractOkKnown caveats
analytics.tslines 20–21. Those are pre-existing public Amplitude write keys (mirror Lightning's ampli config — designed to ship in source like a Stripe publishable key). Not changes introduced by this PR. Suppressing the alert with// nosemgrepon those lines is the right follow-up but out of scope here.🤖 Generated with Claude Code
Note
Medium Risk
Adds new CI workflows plus optional Playwright runtime probing and Anthropic judge calls; misconfiguration could increase CI flakiness/cost or change telemetry behavior via the new global disable switch.
Overview
Completes Phase 1 of the eval suite by expanding Ring 1 coverage to 7 scenarios and wiring additional scoring layers, including optional Layer 4 runtime probing (Playwright headless dev-server boot + intercepting Amplitude requests) and optional Layer 6 LLM judging (Anthropic call with versioned rubric + schema-validated structured verdicts).
Introduces a nightly scheduled workflow that runs the full scenario matrix across two seeds with
--runtime/--judge, uploads per-run reports, and produces a variance summary artifact that flags scenarios whose seed-to-seed score spread exceeds 10 (and exits non-zero when flagged). PR gating is updated to run the full Ring 1 matrix.Adds a
--no-telemetryCLI flag (andAMPLITUDE_WIZARD_NO_TELEMETRY=1support) that hard-disables internal analytics viaanalytics.disable(), and the eval runner now forwards this to wizard child runs to prevent synthetic eval traffic from reaching production telemetry.Reviewed by Cursor Bugbot for commit 7734647. Bugbot is set up for automated code reviews on this repo. Configure here.