Skip to content

feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry#670

Open
kaiapeacock-eng wants to merge 7 commits into
mainfrom
ba-104-finish-eval-plan-v2
Open

feat(evals): finish Phase 1 — Ring 1 complete, Layers 0-6 wired, --no-telemetry#670
kaiapeacock-eng wants to merge 7 commits into
mainfrom
ba-104-finish-eval-plan-v2

Conversation

@kaiapeacock-eng
Copy link
Copy Markdown
Collaborator

@kaiapeacock-eng kaiapeacock-eng commented May 8, 2026

Summary

Finishes Phase 1 of the SDK-integration eval suite per docs/evals.md. The PR-gate ships value today; the nightly + variance harness lights up on the next 07:00 UTC cron; Layer 5 (ingestion) is now formally out of scope and owned by e2e-tests/.

7 commits, 105 files changed, ~3.7k LOC added.

What's in this PR

Ring 1 scenarios — 7 of 7 shipped

Each scenario has a lockfile-pinned pristine fixture, a hand-authored golden, and lands a passing 90/90 score on the existing scorer stack.

New scorers

  • L1-idempotent-rerun (criterion 16) — opt-in via Artifact.secondRunLog; replays can pin golden/run-second.ndjson
  • L1-self-verification-passes (criterion 17) — reads verification_result events, fails if any phase reports success=false
  • L4-runtime-probe — Playwright dynamic-import gated. Boots the dev server, navigates a known route, intercepts Amplitude requests (does not forward — Layer 5 is out of scope), grades pageStatusCode + console errors + ≥1 SDK request fired
  • L6-judge-verdict — aggregates structured verdicts from a Sonnet judge prompted with a versioned rubric

Layer 6 LLM judge

  • evals/rubrics/judge-prompt.md — canonical system prompt
  • evals/rubrics/rubric-version.txt — bumped on every rubric change so score drift correlates with rubric changes
  • evals/runner/judge.ts — calls Anthropic via @anthropic-ai/sdk, validates response with Zod, drops uncited verdicts as flakes
  • Citations required: every verdict must carry evidence_path + evidence_line_start

CI workflows

  • .github/workflows/evals-pr-scenarios.yml — Ring 1 matrix gets all 7 scenarios (was 1)
  • .github/workflows/evals-nightly.yml — daily 07:00 UTC + manual dispatch, full Ring 1+2 × 2 seeds, installs Playwright for Layer 4, runs --judge for Layer 6, uploads per-seed reports + a separate variance-summary artifact

Variance harness

  • evals/bin/variance-summary.ts — scans every report.json under the artifact-download root, groups by scenario, computes seed-to-seed score spread, flags any scenario > 10 points (the spec's non-determinism threshold) and exits non-zero

--no-telemetry (resolves spec decision #2)

  • bin.ts flag + AMPLITUDE_WIZARD_NO_TELEMETRY=1 env var
  • Analytics.disable() / isDisabled() — hard switch that short-circuits both capture() and identifyUser() before the SDK touches anything
  • Eval runner forwards both flag and env var to wizard child (belt-and-braces)
  • This resolves the long-open decision Bump minimatch and eslint #2 (eval-only Amplitude project): synthetic eval runs are now silent, so there's nothing to isolate

Layer 5 — formally out of scope

End-to-end ingestion verification is now owned by e2e-tests/test-applications/, not the eval suite. Layer 4 covers "did the SDK fire ≥1 outbound request"; whether that request reaches Amplitude under real network/hosting/bundling shape belongs to e2e tests, not regression evals on prompt/skill changes. README + spec document the boundary.

Value the PR delivers today

The PR gate runs Layers 0–3 against 7 Ring 1 goldens, wall-clock <8 min, and catches:

Hard fails

  • Hardcoded API key in any agent-written file
  • Wrong SDK family for the framework
  • Stale-legacy-SDK coexistence after a pre-existing-vendor run (new in this PR)
  • Multiple init() calls in the same project
  • Build-config bridging (next.config.* / vite.config.* / etc.)

Heavy fails

  • Confirmed event missing a track() call
  • Project-local lib/amplitude.ts re-export wrapper
  • Server/client boundary breakage
  • Init in the wrong entry file for the framework
  • Build/typecheck broken by the wizard's diff (L3, when scenario pins BuildResult)

Medium fails

  • Wrong env var prefix
  • setup_complete event lying about which files were written
  • Init options without comments
  • Wrong package version range / surprise vendor packages

Warns

  • Property-key naming drift off lowercase-with-spaces

The 3 negative-control tests (added in #577) are the existence proof — they introduce these regressions into a passing scenario and confirm the right scorer at the right layer fires.

Coverage scope: the gate fires on PRs touching commandments.ts, agent-runner.ts, agent-interface.ts, wizard-tools.ts, agent-events.ts, exit-codes.ts, src/frameworks/**, src/ui/agent-ui.ts, skills/integration/**, or evals/**. So every prompt change, skill change, framework config change, or wire-format change gets graded against 7 reference integrations before it can land.

Required follow-up: wizard-side OAuth wiring

This PR does not enable live runs. The eval runner already forwards WIZARD_OAUTH_TOKEN / WIZARD_EXPIRES_AT / WIZARD_ZONE to the wizard child process, but the wizard itself doesn't read those env vars yet. Until that wiring lands, the only options for live runs are:

  1. EVALS_ALLOW_API_KEY_BYPASS=1 + --api-key — opt-in fallback. Routes LLM calls direct-to-Anthropic, skipping the Amplitude LLM gateway. This cannot catch gateway-specific bugs (Vertex schema noise, beta-header rejections, the proxy 400 class). The runner prints a warning when this mode is used.
  2. Wait for the wizard-side wiring follow-up. Once WIZARD_OAUTH_TOKEN is read on the wizard side, gateway-routed live evals work in CI without the bypass.

The follow-up PR needs to:

  • Read WIZARD_OAUTH_TOKEN at wizard startup as an alternative to the OAuth browser flow
  • Validate WIZARD_EXPIRES_AT (skip refresh if still valid; refresh otherwise)
  • Honor WIZARD_ZONE for region routing
  • Add a unit test that confirms the env-var path produces the same auth state as the OAuth path

Until that lands:

  • PR-gate runs on goldens (deterministic, no auth)
  • Nightly runs on goldens (deterministic, no auth)
  • Layer 6 judge reads ANTHROPIC_API_KEY directly (not gateway-routed yet — that's a separate but related follow-up)

Other follow-ups (unblocked, not gating)

  • Judge gateway routing. L6 currently calls @anthropic-ai/sdk directly. Routing through the wizard's LLM gateway would share the same auth + rate-limit story.
  • AST-based init counting. Layer 0's single-init-call.ts uses regex; can false-positive on commented-out callsites. AST move bumps it from L0 to L2.
  • Ring 2 expansion. 7 Ring 1 scenarios shipped; the Integration enum has 18 entries and skills/integration/ has ~32. The skill/detector drift sidebar in docs/evals.md lays out the gap.
  • Live runs in CI. Once OAuth wiring is in place, swap goldens for live runs on the nightly (PR gate stays on goldens for speed).

Test plan

  • pnpm exec vitest run evals/ — 124/124 across 19 files
  • pnpm test (full suite) — 3577/3577 across 245 files
  • pnpm lint — clean (1 pre-existing warning, unrelated)
  • pnpm evals:run nextjs-app-router/vanilla — 90/90, contractOk
  • pnpm evals:run nextjs-app-router/pre-existing-vendor — 90/90, contractOk
  • pnpm evals:run react-router-7/framework — 90/90, contractOk
  • pnpm evals:run react-router-7/data — 90/90, contractOk
  • pnpm evals:run react-vite/vanilla — 90/90, contractOk
  • pnpm evals:run expo/vanilla — 90/90, contractOk
  • pnpm evals:run generic/probe — 90/90, contractOk
  • First nightly run completes cleanly — verifiable only after merge + 07:00 UTC cron
  • Variance harness flags zero scenarios on the first nightly — verifiable only after merge

Known caveats

  • First nightly is calibration. Layers 4 + 6 are wired and unit-tested but never exercised end-to-end against actual Playwright / actual Anthropic. The first 1–2 weeks of nightly runs are where we'll find out if Playwright reliably boots every framework's dev server in CI, and if Sonnet judge variance stays under 10 points across seeds.
  • Semgrep flags analytics.ts lines 20–21. Those are pre-existing public Amplitude write keys (mirror Lightning's ampli config — designed to ship in source like a Stripe publishable key). Not changes introduced by this PR. Suppressing the alert with // nosemgrep on those lines is the right follow-up but out of scope here.

🤖 Generated with Claude Code


Note

Medium Risk
Adds new CI workflows plus optional Playwright runtime probing and Anthropic judge calls; misconfiguration could increase CI flakiness/cost or change telemetry behavior via the new global disable switch.

Overview
Completes Phase 1 of the eval suite by expanding Ring 1 coverage to 7 scenarios and wiring additional scoring layers, including optional Layer 4 runtime probing (Playwright headless dev-server boot + intercepting Amplitude requests) and optional Layer 6 LLM judging (Anthropic call with versioned rubric + schema-validated structured verdicts).

Introduces a nightly scheduled workflow that runs the full scenario matrix across two seeds with --runtime/--judge, uploads per-run reports, and produces a variance summary artifact that flags scenarios whose seed-to-seed score spread exceeds 10 (and exits non-zero when flagged). PR gating is updated to run the full Ring 1 matrix.

Adds a --no-telemetry CLI flag (and AMPLITUDE_WIZARD_NO_TELEMETRY=1 support) that hard-disables internal analytics via analytics.disable(), and the eval runner now forwards this to wizard child runs to prevent synthetic eval traffic from reaching production telemetry.

Reviewed by Cursor Bugbot for commit 7734647. Bugbot is set up for automated code reviews on this repo. Configure here.

Pulls in the open scenario PRs (rr7-data, react-vite) and adds the
remaining three Ring 1 scenarios per docs/evals.md:
  * nextjs-app-router/pre-existing-vendor — exercises augment-don't-replace
    with a stale @amplitude/analytics-browser install, golden shows clean
    migration to @amplitude/unified
  * expo/vanilla — Expo Router app, EXPO_PUBLIC_ env prefix, Expo SDK
    web export as the build command (no Xcode/Android tooling needed)
  * generic/probe — stripped vanilla HTML/JS app with framework markers
    removed; useDetection=false so we grade the wizard's detection
    pipeline against `generic` as ground truth

Plus supporting work:
  * Layer 0 correct-sdk-package now hard-fails on stale-legacy-SDK
    coexistence (failure mode #10: both @amplitude/unified and
    @amplitude/analytics-browser surviving a pre-existing-vendor run)
  * Framework→SDK table picks up `react-router` and `expo` aliases
  * Layer 1 idempotent-rerun scorer (criterion 16) — opt-in via
    Artifact.secondRunLog; skip-passes with weight 0 when absent
  * Runner test parameterized over all seven Ring 1 goldens
  * PR-gate matrix lists every Ring 1 scenario

Verified: pnpm exec vitest run evals/ — 94/94 across 13 files.
  * Register L1-idempotent-rerun (criterion 16) in score.ts. Replays
    that pin a `golden/run-second.ndjson` get full coverage; absence
    skip-passes with weight 0.
  * Add L1-self-verification-passes (criterion 17). Reads the
    `verification_result` events from the run log and asserts every
    phase succeeded. Maps failure reasons into the scorer's detail so
    triagers see the offending phase without opening the artifact.
  * Plumb optional secondRunLog through replay loading.
  * Unit tests for both scorers (10 new tests, 100 total).

Verified: pnpm exec vitest run evals/ — 100/100 across 14 files.
Layer 4 boots the post-wizard working tree in a headless browser,
loads a known route, and grades:
  * pageStatusCode (2xx/3xx required)
  * uncaught console errors (any → fail with the first message)
  * outbound Amplitude requests (≥1 required — proves the SDK actually
    ran instead of being a tree-shaken no-op import)

Implementation notes:
  * `evals/runner/runtime.ts` — runRuntimeProbe() spawns the dev
    server, waits for the port, dynamic-imports playwright, navigates,
    and intercepts Amplitude requests so they never forward.
  * Dynamic playwright import means the framework compiles + tests
    cleanly without the dependency. Nightly runners install playwright
    explicitly; PR-gate runs skip-pass with a clear "playwright not
    installed" detail.
  * `runtimeProbe` field added to scenario.json schema (devCommand,
    route, port, optional bootTimeoutMs).
  * `--runtime` flag on bin/run-eval.ts opts in.
  * golden/runtime-result.json pins recorded outcomes for replay
    coverage.

Verified: pnpm exec vitest run evals/ — 110/110 across 16 files.
Layer 6 hands a sealed view of the post-wizard diff + setup_complete
+ confirmed events to a Claude Sonnet judge prompted with the 19-point
rubric (versioned via evals/rubrics/rubric-version.txt). The judge
returns one structured-JSON verdict per criterion it grades, each
required to carry an evidence_path + line — uncited verdicts are
treated as flakes and dropped.

Implementation:
  * evals/rubrics/judge-prompt.md — canonical system prompt; spells
    out the output JSON contract, what to grade (criteria 7, 15, 19
    plus free-form taste), and what to leave to deterministic layers.
  * evals/rubrics/rubric-version.txt — bumped on every rubric change
    so score drift correlates with rubric changes.
  * evals/runner/judge.ts — runJudge() loads the rubric, builds the
    user message (scenario + diff + setup_complete + events), calls
    @anthropic-ai/sdk, validates the response against Zod schemas,
    returns a JudgeResult.
  * evals/scorers/layer6-judge/judge-verdict.ts — aggregates verdicts
    into one ScorerResult; weight is the sum of per-criterion weights;
    failure detail names the offending criteria with citations.
  * --judge flag on bin/run-eval.ts opts in.
  * Optional golden/judge-result.json for replay coverage.

Auth surface: ANTHROPIC_API_KEY today; gateway routing via the
wizard's own OAuth path is the next step.

Verified: 122/122 tests across 18 files; lint clean.
Adds the spec's nightly CI gate plus the harness that catches
non-deterministic scenarios across seeds.

  * .github/workflows/evals-nightly.yml — daily 07:00 UTC cron + manual
    dispatch, full Ring 1+2 matrix × 2 seeds, installs playwright for
    the Layer 4 probe, runs --judge for Layer 6, uploads per-seed
    reports plus a variance summary.
  * evals/bin/variance-summary.ts — scans all report.json files,
    groups by scenario, computes seed-to-seed score spread, flags any
    scenario with spread > 10 (the spec's non-determinism threshold)
    and exits non-zero so CI surfaces the issue.
  * --seed flag on evals:run plus EVAL_SEED env var passthrough so the
    nightly matrix can pin a per-job seed onto the artifact and the
    variance summary can stitch reports together.
  * Unit test for the variance summary (passes within threshold,
    flags + exits non-zero when spread > 10).

Verified: 124/124 tests across 19 files; lint clean.
Spell out the layer status, document Layer 5 as the only deferred
piece (blocked on the eval-only Amplitude project — open decision #2),
and refresh the quick-start commands now that --runtime / --judge /
--seed are wired.

Verified: pnpm test — 3573/3573 across 245 test files; lint clean.
Adds a hard kill-switch on the wizard's internal analytics so eval
synthetic runs never reach the prod analytics project. With this in
place, the eval suite no longer needs an "eval-only Amplitude
project" (decision #2 in the spec) — synthetic runs are silent.

  * bin.ts — new --no-telemetry flag, also honored via
    AMPLITUDE_WIZARD_NO_TELEMETRY=1.
  * src/utils/analytics.ts — Analytics.disable() / isDisabled().
    capture() and identifyUser() short-circuit before the SDK is
    touched (no init, no track, no network). Idempotent.
  * src/commands/default.ts — applies disable() before any other code
    path so the very first event ('session started') short-circuits
    inside the singleton.
  * evals/runner/invoke-wizard.ts — runLive() forwards both the flag
    and the env var to the wizard child, belt-and-braces.

Layer 5 (ingestion verification) is now formally out of scope for
the eval suite — owned by e2e-tests/. The README + spec document the
boundary: Layer 4 covers "did the SDK fire ≥1 outbound request"; the
ingestion question (does the request reach Amplitude under real
network/hosting/bundling shape) belongs to e2e tests, not regression
evals on prompt/skill changes.

Test coverage: 4 new tests on Analytics.disable() + isDisabled()
(track/identify short-circuit when disabled). Total 3577/3577 across
245 test files; lint clean.
@kaiapeacock-eng kaiapeacock-eng requested a review from a team as a code owner May 8, 2026 21:19
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Nightly workflow artifact names contain invalid slash characters
    • Added a sanitization step to replace forward slashes with hyphens in scenario names before using them in artifact names.
  • ✅ Fixed: Runtime probe reports ok=true for 5xx status codes
    • Changed the ok flag condition to require status codes in the 2xx-3xx range (>= 200 && < 400) instead of just > 0.

Create PR

Or push these changes by commenting:

@cursor push 7666572eeb
Preview (7666572eeb)
diff --git a/.github/workflows/evals-nightly.yml b/.github/workflows/evals-nightly.yml
--- a/.github/workflows/evals-nightly.yml
+++ b/.github/workflows/evals-nightly.yml
@@ -81,11 +81,15 @@
           EVAL_SEED: ${{ matrix.seed }}
         run: pnpm evals:run ${{ matrix.scenario }} --runtime --judge
 
+      - name: Sanitize scenario name
+        id: sanitize
+        run: echo "name=$(echo '${{ matrix.scenario }}' | tr '/' '-')" >> $GITHUB_OUTPUT
+
       - name: Upload report
         if: always()
         uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
         with:
-          name: eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }}
+          name: eval-report-${{ steps.sanitize.outputs.name }}-seed${{ matrix.seed }}
           path: evals/reports/**/report.json
           retention-days: 30
           if-no-files-found: ignore

diff --git a/evals/runner/runtime.ts b/evals/runner/runtime.ts
--- a/evals/runner/runtime.ts
+++ b/evals/runner/runtime.ts
@@ -253,7 +253,10 @@
         waitUntil: 'networkidle',
       });
       pageStatusCode = response?.status() ?? 0;
-      ok = consoleErrors.length === 0 && pageStatusCode > 0;
+      ok =
+        consoleErrors.length === 0 &&
+        pageStatusCode >= 200 &&
+        pageStatusCode < 400;
     } finally {
       await browser.close();
     }

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 7734647. Configure here.

if: always()
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
with:
name: eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nightly workflow artifact names contain invalid slash characters

High Severity

The artifact name eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} will contain forward slashes from scenario names like nextjs-app-router/vanilla, producing names like eval-report-nextjs-app-router/vanilla-seed1. GitHub Actions upload-artifact v4 rejects artifact names containing / characters, causing every upload step in the nightly matrix to fail. The download-artifact step in the variance-summary job would then find no reports.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7734647. Configure here.

Comment thread evals/runner/runtime.ts
waitUntil: 'networkidle',
});
pageStatusCode = response?.status() ?? 0;
ok = consoleErrors.length === 0 && pageStatusCode > 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime probe reports ok=true for 5xx status codes

Low Severity

The ok flag is computed as consoleErrors.length === 0 && pageStatusCode > 0, meaning a 500 response with no console errors sets ok: true. The RuntimeResult.ok docstring says "True when the probe booted, navigated, and exited cleanly," but a 5xx is not a clean exit. The L4 scorer happens to catch this via a separate status-code check, but the ok field on pinned golden/runtime-result.json files could mislead future scorers or tooling that trusts it at face value.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7734647. Configure here.

Copy link
Copy Markdown
Member

@kelsonpw kelsonpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid Phase-1 landing — Ring 1 scenarios, scorers, judge prompt, and the variance harness all hang together cleanly. Every Ring 1 CI check is green at HEAD (incl. all 7 scenarios, scorer vitest, build, lint), and the test coverage on the new scorers/runner is appropriate for a contained eval subsystem.

One blocker before merge:

.github/workflows/evals-nightly.yml:88 — the Bugbot HIGH finding is real and unaddressed at HEAD (7734647). Artifact name eval-report-${{ matrix.scenario }}-seed${{ matrix.seed }} interpolates scenario IDs like nextjs-app-router/vanilla, and actions/upload-artifact@v4 rejects names containing /. Every nightly upload step will fail and the download-artifact consumer in variance-summary will get nothing. Bugbot's autofix (sanitize step that tr '/' '-') is the right shape — pull it in or apply equivalent. Same fix should be considered for the artifact name in any future workflow that interpolates scenario paths.

Non-blocking notes:

  • evals/runner/runtime.ts:256ok = consoleErrors.length === 0 && pageStatusCode > 0 returns true for 5xx pages with no console errors. The L4 scorer catches it via a separate status check, but the field's docstring promises a "clean exit." Worth tightening to pageStatusCode >= 200 && pageStatusCode < 400 so pinned golden/runtime-result.json files reflect reality. Could be follow-up.
  • 109 files / ~3.7k LOC is large but the bulk is fixtures and goldens, which is what you want for a scenario suite. Reviewable.

Re-request review after the artifact-name fix lands and I'll approve.

Copy link
Copy Markdown
Member

@kelsonpw kelsonpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up — this is ~3.7k LOC across 105 files in one PR. That's a lot to review with confidence in a single pass, and I'd hate for substantive issues to slip past because reviewers run out of steam.

Looking at the commit list, the natural seams are already pretty clean:

  • Ring 1 scenarios + Layer 0 stale-legacy hard-fail (commit cf8554dc) — fixtures + goldens + one scorer tweak
  • Layer 1 idempotent-rerun + self-verification scorers (commit c2328b20)
  • Layer 4 runtime probe (commit 6359cae5) — Playwright dynamic-import + scenario schema bump
  • Layer 6 LLM judge (commit 2301dd4b) — rubric + Anthropic SDK call + judge scorer
  • Nightly workflow + variance harness (commit 877c2586)
  • --no-telemetry flag + Layer 5 deferral (commit 7734647c) — touches bin.ts, analytics.ts, commands/default.ts (orthogonal to evals)

Would it be possible to land this as ~3 sequential PRs? E.g.:

  • #670a — Ring 1 scenarios + Layer 1 scorers (fixtures + deterministic scoring, no new external deps)
  • #670b — Layer 4 + Layer 6 + nightly workflow (Playwright + judge + variance harness — the new external surfaces)
  • #670c--no-telemetry (the only commit that touches wizard runtime code outside evals/; reviewable independently by the analytics owners)

Not blocking — happy to power through if it has to ship as one unit, especially given the spec calls Phase 1 a single milestone — but smaller PRs would land faster and get higher-confidence reviews. Calling it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants