feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry by kaiapeacock-eng · Pull Request #670 · amplitude/wizard

kaiapeacock-eng · 2026-05-08T21:19:08Z

Summary

Finishes Phase 1 of the SDK-integration eval suite per docs/evals.md. The PR-gate ships value today; the nightly + variance harness lights up on the next 07:00 UTC cron; Layer 5 (ingestion) is now formally out of scope and owned by e2e-tests/.

7 commits, 105 files changed, ~3.7k LOC added.

What's in this PR

Ring 1 scenarios — 7 of 7 shipped

nextjs-app-router/vanilla (was on main)
nextjs-app-router/pre-existing-vendor — exercises augment-don't-replace; Layer 0 now hard-fails on stale-legacy-SDK coexistence (failure mode AMP-150672 Point local dev at Langley wizard proxy #10)
react-router-7/framework (was on main)
react-router-7/data — pulled from PR feat(evals): Ring 1 scenario — react-router-7/data #575
react-vite/vanilla — pulled from PR feat(evals): Ring 1 scenario — react-vite/vanilla #576
expo/vanilla — Expo Router, EXPO_PUBLIC_ env prefix, expo export --platform web build
generic/probe — stripped vanilla HTML/JS, useDetection: false so we grade the wizard's own detection pipeline

Each scenario has a lockfile-pinned pristine fixture, a hand-authored golden, and lands a passing 90/90 score on the existing scorer stack.

New scorers

L1-idempotent-rerun (criterion 16) — opt-in via Artifact.secondRunLog; replays can pin golden/run-second.ndjson
L1-self-verification-passes (criterion 17) — reads verification_result events, fails if any phase reports success=false
L4-runtime-probe — Playwright dynamic-import gated. Boots the dev server, navigates a known route, intercepts Amplitude requests (does not forward — Layer 5 is out of scope), grades pageStatusCode + console errors + ≥1 SDK request fired
L6-judge-verdict — aggregates structured verdicts from a Sonnet judge prompted with a versioned rubric

Layer 6 LLM judge

evals/rubrics/judge-prompt.md — canonical system prompt
evals/rubrics/rubric-version.txt — bumped on every rubric change so score drift correlates with rubric changes
evals/runner/judge.ts — calls Anthropic via @anthropic-ai/sdk, validates response with Zod, drops uncited verdicts as flakes
Citations required: every verdict must carry evidence_path + evidence_line_start

CI workflows

.github/workflows/evals-pr-scenarios.yml — Ring 1 matrix gets all 7 scenarios (was 1)
.github/workflows/evals-nightly.yml — daily 07:00 UTC + manual dispatch, full Ring 1+2 × 2 seeds, installs Playwright for Layer 4, runs --judge for Layer 6, uploads per-seed reports + a separate variance-summary artifact

Variance harness

evals/bin/variance-summary.ts — scans every report.json under the artifact-download root, groups by scenario, computes seed-to-seed score spread, flags any scenario > 10 points (the spec's non-determinism threshold) and exits non-zero

--no-telemetry (resolves spec decision #2)

bin.ts flag + AMPLITUDE_WIZARD_NO_TELEMETRY=1 env var
Analytics.disable() / isDisabled() — hard switch that short-circuits both capture() and identifyUser() before the SDK touches anything
Eval runner forwards both flag and env var to wizard child (belt-and-braces)
This resolves the long-open decision Bump minimatch and eslint #2 (eval-only Amplitude project): synthetic eval runs are now silent, so there's nothing to isolate

Layer 5 — formally out of scope

End-to-end ingestion verification is now owned by e2e-tests/test-applications/, not the eval suite. Layer 4 covers "did the SDK fire ≥1 outbound request"; whether that request reaches Amplitude under real network/hosting/bundling shape belongs to e2e tests, not regression evals on prompt/skill changes. README + spec document the boundary.

Value the PR delivers today

The PR gate runs Layers 0–3 against 7 Ring 1 goldens, wall-clock <8 min, and catches:

Hard fails

Hardcoded API key in any agent-written file
Wrong SDK family for the framework
Stale-legacy-SDK coexistence after a pre-existing-vendor run (new in this PR)
Multiple init() calls in the same project
Build-config bridging (next.config.* / vite.config.* / etc.)

Heavy fails

Confirmed event missing a track() call
Project-local lib/amplitude.ts re-export wrapper
Server/client boundary breakage
Init in the wrong entry file for the framework
Build/typecheck broken by the wizard's diff (L3, when scenario pins BuildResult)

Medium fails

Wrong env var prefix
setup_complete event lying about which files were written
Init options without comments
Wrong package version range / surprise vendor packages

Warns

Property-key naming drift off lowercase-with-spaces

The 3 negative-control tests (added in #577) are the existence proof — they introduce these regressions into a passing scenario and confirm the right scorer at the right layer fires.

Coverage scope: the gate fires on PRs touching commandments.ts, agent-runner.ts, agent-interface.ts, wizard-tools.ts, agent-events.ts, exit-codes.ts, src/frameworks/**, src/ui/agent-ui.ts, skills/integration/**, or evals/**. So every prompt change, skill change, framework config change, or wire-format change gets graded against 7 reference integrations before it can land.

Required follow-up: wizard-side OAuth wiring

This PR does not enable live runs. The eval runner already forwards WIZARD_OAUTH_TOKEN / WIZARD_EXPIRES_AT / WIZARD_ZONE to the wizard child process, but the wizard itself doesn't read those env vars yet. Until that wiring lands, the only options for live runs are:

EVALS_ALLOW_API_KEY_BYPASS=1 + --api-key — opt-in fallback. Routes LLM calls direct-to-Anthropic, skipping the Amplitude LLM gateway. This cannot catch gateway-specific bugs (Vertex schema noise, beta-header rejections, the proxy 400 class). The runner prints a warning when this mode is used.
Wait for the wizard-side wiring follow-up. Once WIZARD_OAUTH_TOKEN is read on the wizard side, gateway-routed live evals work in CI without the bypass.

The follow-up PR needs to:

Read WIZARD_OAUTH_TOKEN at wizard startup as an alternative to the OAuth browser flow
Validate WIZARD_EXPIRES_AT (skip refresh if still valid; refresh otherwise)
Honor WIZARD_ZONE for region routing
Add a unit test that confirms the env-var path produces the same auth state as the OAuth path

Until that lands:

PR-gate runs on goldens (deterministic, no auth)
Nightly runs on goldens (deterministic, no auth)
Layer 6 judge reads ANTHROPIC_API_KEY directly (not gateway-routed yet — that's a separate but related follow-up)

Other follow-ups (unblocked, not gating)

Judge gateway routing. L6 currently calls @anthropic-ai/sdk directly. Routing through the wizard's LLM gateway would share the same auth + rate-limit story.
AST-based init counting. Layer 0's single-init-call.ts uses regex; can false-positive on commented-out callsites. AST move bumps it from L0 to L2.
Ring 2 expansion. 7 Ring 1 scenarios shipped; the Integration enum has 18 entries and skills/integration/ has ~32. The skill/detector drift sidebar in docs/evals.md lays out the gap.
Live runs in CI. Once OAuth wiring is in place, swap goldens for live runs on the nightly (PR gate stays on goldens for speed).

Test plan

Known caveats

First nightly is calibration. Layers 4 + 6 are wired and unit-tested but never exercised end-to-end against actual Playwright / actual Anthropic. The first 1–2 weeks of nightly runs are where we'll find out if Playwright reliably boots every framework's dev server in CI, and if Sonnet judge variance stays under 10 points across seeds.
Semgrep flags analytics.ts lines 20–21. Those are pre-existing public Amplitude write keys (mirror Lightning's ampli config — designed to ship in source like a Stripe publishable key). Not changes introduced by this PR. Suppressing the alert with // nosemgrep on those lines is the right follow-up but out of scope here.

🤖 Generated with Claude Code

Note

Medium Risk
Adds new CI workflows plus optional Playwright runtime probing and Anthropic judge calls; misconfiguration could increase CI flakiness/cost or change telemetry behavior via the new global disable switch.

Overview
Completes Phase 1 of the eval suite by expanding Ring 1 coverage to 7 scenarios and wiring additional scoring layers, including optional Layer 4 runtime probing (Playwright headless dev-server boot + intercepting Amplitude requests) and optional Layer 6 LLM judging (Anthropic call with versioned rubric + schema-validated structured verdicts).

Introduces a nightly scheduled workflow that runs the full scenario matrix across two seeds with --runtime/--judge, uploads per-run reports, and produces a variance summary artifact that flags scenarios whose seed-to-seed score spread exceeds 10 (and exits non-zero when flagged). PR gating is updated to run the full Ring 1 matrix.

Adds a --no-telemetry CLI flag (and AMPLITUDE_WIZARD_NO_TELEMETRY=1 support) that hard-disables internal analytics via analytics.disable(), and the eval runner now forwards this to wizard child runs to prevent synthetic eval traffic from reaching production telemetry.

^{Reviewed by Cursor Bugbot for commit 7734647. Bugbot is set up for automated code reviews on this repo. Configure here.}

Pulls in the open scenario PRs (rr7-data, react-vite) and adds the remaining three Ring 1 scenarios per docs/evals.md: * nextjs-app-router/pre-existing-vendor — exercises augment-don't-replace with a stale @amplitude/analytics-browser install, golden shows clean migration to @amplitude/unified * expo/vanilla — Expo Router app, EXPO_PUBLIC_ env prefix, Expo SDK web export as the build command (no Xcode/Android tooling needed) * generic/probe — stripped vanilla HTML/JS app with framework markers removed; useDetection=false so we grade the wizard's detection pipeline against `generic` as ground truth Plus supporting work: * Layer 0 correct-sdk-package now hard-fails on stale-legacy-SDK coexistence (failure mode #10: both @amplitude/unified and @amplitude/analytics-browser surviving a pre-existing-vendor run) * Framework→SDK table picks up `react-router` and `expo` aliases * Layer 1 idempotent-rerun scorer (criterion 16) — opt-in via Artifact.secondRunLog; skip-passes with weight 0 when absent * Runner test parameterized over all seven Ring 1 goldens * PR-gate matrix lists every Ring 1 scenario Verified: pnpm exec vitest run evals/ — 94/94 across 13 files.

* Register L1-idempotent-rerun (criterion 16) in score.ts. Replays that pin a `golden/run-second.ndjson` get full coverage; absence skip-passes with weight 0. * Add L1-self-verification-passes (criterion 17). Reads the `verification_result` events from the run log and asserts every phase succeeded. Maps failure reasons into the scorer's detail so triagers see the offending phase without opening the artifact. * Plumb optional secondRunLog through replay loading. * Unit tests for both scorers (10 new tests, 100 total). Verified: pnpm exec vitest run evals/ — 100/100 across 14 files.

Layer 4 boots the post-wizard working tree in a headless browser, loads a known route, and grades: * pageStatusCode (2xx/3xx required) * uncaught console errors (any → fail with the first message) * outbound Amplitude requests (≥1 required — proves the SDK actually ran instead of being a tree-shaken no-op import) Implementation notes: * `evals/runner/runtime.ts` — runRuntimeProbe() spawns the dev server, waits for the port, dynamic-imports playwright, navigates, and intercepts Amplitude requests so they never forward. * Dynamic playwright import means the framework compiles + tests cleanly without the dependency. Nightly runners install playwright explicitly; PR-gate runs skip-pass with a clear "playwright not installed" detail. * `runtimeProbe` field added to scenario.json schema (devCommand, route, port, optional bootTimeoutMs). * `--runtime` flag on bin/run-eval.ts opts in. * golden/runtime-result.json pins recorded outcomes for replay coverage. Verified: pnpm exec vitest run evals/ — 110/110 across 16 files.

Layer 6 hands a sealed view of the post-wizard diff + setup_complete + confirmed events to a Claude Sonnet judge prompted with the 19-point rubric (versioned via evals/rubrics/rubric-version.txt). The judge returns one structured-JSON verdict per criterion it grades, each required to carry an evidence_path + line — uncited verdicts are treated as flakes and dropped. Implementation: * evals/rubrics/judge-prompt.md — canonical system prompt; spells out the output JSON contract, what to grade (criteria 7, 15, 19 plus free-form taste), and what to leave to deterministic layers. * evals/rubrics/rubric-version.txt — bumped on every rubric change so score drift correlates with rubric changes. * evals/runner/judge.ts — runJudge() loads the rubric, builds the user message (scenario + diff + setup_complete + events), calls @anthropic-ai/sdk, validates the response against Zod schemas, returns a JudgeResult. * evals/scorers/layer6-judge/judge-verdict.ts — aggregates verdicts into one ScorerResult; weight is the sum of per-criterion weights; failure detail names the offending criteria with citations. * --judge flag on bin/run-eval.ts opts in. * Optional golden/judge-result.json for replay coverage. Auth surface: ANTHROPIC_API_KEY today; gateway routing via the wizard's own OAuth path is the next step. Verified: 122/122 tests across 18 files; lint clean.

Adds the spec's nightly CI gate plus the harness that catches non-deterministic scenarios across seeds. * .github/workflows/evals-nightly.yml — daily 07:00 UTC cron + manual dispatch, full Ring 1+2 matrix × 2 seeds, installs playwright for the Layer 4 probe, runs --judge for Layer 6, uploads per-seed reports plus a variance summary. * evals/bin/variance-summary.ts — scans all report.json files, groups by scenario, computes seed-to-seed score spread, flags any scenario with spread > 10 (the spec's non-determinism threshold) and exits non-zero so CI surfaces the issue. * --seed flag on evals:run plus EVAL_SEED env var passthrough so the nightly matrix can pin a per-job seed onto the artifact and the variance summary can stitch reports together. * Unit test for the variance summary (passes within threshold, flags + exits non-zero when spread > 10). Verified: 124/124 tests across 19 files; lint clean.

Spell out the layer status, document Layer 5 as the only deferred piece (blocked on the eval-only Amplitude project — open decision #2), and refresh the quick-start commands now that --runtime / --judge / --seed are wired. Verified: pnpm test — 3573/3573 across 245 test files; lint clean.

Adds a hard kill-switch on the wizard's internal analytics so eval synthetic runs never reach the prod analytics project. With this in place, the eval suite no longer needs an "eval-only Amplitude project" (decision #2 in the spec) — synthetic runs are silent. * bin.ts — new --no-telemetry flag, also honored via AMPLITUDE_WIZARD_NO_TELEMETRY=1. * src/utils/analytics.ts — Analytics.disable() / isDisabled(). capture() and identifyUser() short-circuit before the SDK is touched (no init, no track, no network). Idempotent. * src/commands/default.ts — applies disable() before any other code path so the very first event ('session started') short-circuits inside the singleton. * evals/runner/invoke-wizard.ts — runLive() forwards both the flag and the env var to the wizard child, belt-and-braces. Layer 5 (ingestion verification) is now formally out of scope for the eval suite — owned by e2e-tests/. The README + spec document the boundary: Layer 4 covers "did the SDK fire ≥1 outbound request"; the ingestion question (does the request reach Amplitude under real network/hosting/bundling shape) belongs to e2e tests, not regression evals on prompt/skill changes. Test coverage: 4 new tests on Analytics.disable() + isDisabled() (track/identify short-circuit when disabled). Total 3577/3577 across 245 test files; lint clean.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Nightly workflow artifact names contain invalid slash characters
- Added a sanitization step to replace forward slashes with hyphens in scenario names before using them in artifact names.
✅ Fixed: Runtime probe reports ok=true for 5xx status codes
- Changed the ok flag condition to require status codes in the 2xx-3xx range (>= 200 && < 400) instead of just > 0.

Or push these changes by commenting:

@cursor push 7666572eeb

Preview (7666572eeb)

diff --git a/.github/workflows/evals-nightly.yml b/.github/workflows/evals-nightly.yml
--- a/.github/workflows/evals-nightly.yml
+++ b/.github/workflows/evals-nightly.yml
@@ -81,11 +81,15 @@
           EVAL_SEED: ${{ matrix.seed }}
         run: pnpm evals:run ${{ matrix.scenario }} --runtime --judge
 
+      - name: Sanitize scenario name
+        id: sanitize
+        run: echo "name=$(echo '${{ matrix.scenario }}' | tr '/' '-')" >> $GITHUB_OUTPUT
+
       - name: Upload report
         if: always()
         uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
         with:
-          name: eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }}
+          name: eval-report-${{ steps.sanitize.outputs.name }}-seed${{ matrix.seed }}
           path: evals/reports/**/report.json
           retention-days: 30
           if-no-files-found: ignore

diff --git a/evals/runner/runtime.ts b/evals/runner/runtime.ts
--- a/evals/runner/runtime.ts
+++ b/evals/runner/runtime.ts
@@ -253,7 +253,10 @@
         waitUntil: 'networkidle',
       });
       pageStatusCode = response?.status() ?? 0;
-      ok = consoleErrors.length === 0 && pageStatusCode > 0;
+      ok =
+        consoleErrors.length === 0 &&
+        pageStatusCode >= 200 &&
+        pageStatusCode < 400;
     } finally {
       await browser.close();
     }

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 7734647. Configure here.}

cursor · 2026-05-08T21:21:28Z

+        if: always()
+        uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
+        with:
+          name: eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }}


Nightly workflow artifact names contain invalid slash characters

High Severity

The artifact name eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} will contain forward slashes from scenario names like nextjs-app-router/vanilla, producing names like eval-report-nextjs-app-router/vanilla-seed1. GitHub Actions upload-artifact v4 rejects artifact names containing / characters, causing every upload step in the nightly matrix to fail. The download-artifact step in the variance-summary job would then find no reports.

^{Reviewed by Cursor Bugbot for commit 7734647. Configure here.}

cursor · 2026-05-08T21:21:29Z

+        waitUntil: 'networkidle',
+      });
+      pageStatusCode = response?.status() ?? 0;
+      ok = consoleErrors.length === 0 && pageStatusCode > 0;


Runtime probe reports ok=true for 5xx status codes

Low Severity

The ok flag is computed as consoleErrors.length === 0 && pageStatusCode > 0, meaning a 500 response with no console errors sets ok: true. The RuntimeResult.ok docstring says "True when the probe booted, navigated, and exited cleanly," but a 5xx is not a clean exit. The L4 scorer happens to catch this via a separate status-code check, but the ok field on pinned golden/runtime-result.json files could mislead future scorers or tooling that trusts it at face value.

^{Reviewed by Cursor Bugbot for commit 7734647. Configure here.}

kelsonpw

Solid Phase-1 landing — Ring 1 scenarios, scorers, judge prompt, and the variance harness all hang together cleanly. Every Ring 1 CI check is green at HEAD (incl. all 7 scenarios, scorer vitest, build, lint), and the test coverage on the new scorers/runner is appropriate for a contained eval subsystem.

One blocker before merge:

.github/workflows/evals-nightly.yml:88 — the Bugbot HIGH finding is real and unaddressed at HEAD (7734647). Artifact name eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} interpolates scenario IDs like nextjs-app-router/vanilla, and actions/upload-artifact@v4 rejects names containing /. Every nightly upload step will fail and the download-artifact consumer in variance-summary will get nothing. Bugbot's autofix (sanitize step that tr '/' '-') is the right shape — pull it in or apply equivalent. Same fix should be considered for the artifact name in any future workflow that interpolates scenario paths.

Non-blocking notes:

evals/runner/runtime.ts:256 — ok = consoleErrors.length === 0 && pageStatusCode > 0 returns true for 5xx pages with no console errors. The L4 scorer catches it via a separate status check, but the field's docstring promises a "clean exit." Worth tightening to pageStatusCode >= 200 && pageStatusCode < 400 so pinned golden/runtime-result.json files reflect reality. Could be follow-up.
109 files / ~3.7k LOC is large but the bulk is fixtures and goldens, which is what you want for a scenario suite. Reviewable.

Re-request review after the artifact-name fix lands and I'll approve.

kelsonpw

Heads up — this is ~3.7k LOC across 105 files in one PR. That's a lot to review with confidence in a single pass, and I'd hate for substantive issues to slip past because reviewers run out of steam.

Looking at the commit list, the natural seams are already pretty clean:

Ring 1 scenarios + Layer 0 stale-legacy hard-fail (commit cf8554dc) — fixtures + goldens + one scorer tweak
Layer 1 idempotent-rerun + self-verification scorers (commit c2328b20)
Layer 4 runtime probe (commit 6359cae5) — Playwright dynamic-import + scenario schema bump
Layer 6 LLM judge (commit 2301dd4b) — rubric + Anthropic SDK call + judge scorer
Nightly workflow + variance harness (commit 877c2586)
--no-telemetry flag + Layer 5 deferral (commit 7734647c) — touches bin.ts, analytics.ts, commands/default.ts (orthogonal to evals)

Would it be possible to land this as ~3 sequential PRs? E.g.:

#670a — Ring 1 scenarios + Layer 1 scorers (fixtures + deterministic scoring, no new external deps)
#670b — Layer 4 + Layer 6 + nightly workflow (Playwright + judge + variance harness — the new external surfaces)
#670c — --no-telemetry (the only commit that touches wizard runtime code outside evals/; reviewable independently by the analytics owners)

Not blocking — happy to power through if it has to ship as one unit, especially given the spec calls Phase 1 a single milestone — but smaller PRs would land faster and get higher-confidence reviews. Calling it out.

kaiapeacock-eng added 7 commits May 8, 2026 12:53

kaiapeacock-eng requested a review from a team as a code owner May 8, 2026 21:19

cursor Bot reviewed May 8, 2026

View reviewed changes

This was referenced May 8, 2026

feat(evals): Ring 1 scenario — react-router-7/data #575

Closed

feat(evals): Ring 1 scenario — react-vite/vanilla #576

Closed

kelsonpw requested changes May 8, 2026

View reviewed changes

kelsonpw reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry#670

feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry#670
kaiapeacock-eng wants to merge 7 commits into
mainfrom
ba-104-finish-eval-plan-v2

kaiapeacock-eng commented May 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 8, 2026

Uh oh!

cursor Bot May 8, 2026

Uh oh!

kelsonpw left a comment

Uh oh!

kelsonpw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiapeacock-eng commented May 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Ring 1 scenarios — 7 of 7 shipped

New scorers

Layer 6 LLM judge

CI workflows

Variance harness

--no-telemetry (resolves spec decision #2)

Layer 5 — formally out of scope

Value the PR delivers today

Required follow-up: wizard-side OAuth wiring

Other follow-ups (unblocked, not gating)

Test plan

Known caveats

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Nightly workflow artifact names contain invalid slash characters

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Runtime probe reports ok=true for 5xx status codes

Uh oh!

kelsonpw left a comment

Choose a reason for hiding this comment

Uh oh!

kelsonpw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiapeacock-eng commented May 8, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading