ivanmkc · ivanmkc · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -39,7 +39,7 @@ Supported subset: `graph TD|LR` nodes `A[x]` `B{decision}` `C[(db)]` edges `-->`
 Push native JSON to a running viewer — run it locally
 (`PUSH_TOKEN=dev PORT=8080 node packages/viewer/dist/server.js`, open `http://localhost:8080/w/me/`;
 the viewer ships in this repo, not the npm CLI) or point at a deployed one. Configured by two
-env vars — **if either is unset, `push`/`status` exit 3** with `Set TERMCHART_VIEWER_URL and TERMCHART_VIEWER_TOKEN.`:
+env vars — **if either is unset, `push`/`status` exit 4** with `…are not set: no termchart viewer configured.`:
 
 ```bash
 export TERMCHART_VIEWER_URL="http://localhost:8080/w/me"

diff --git a/docs/plans/2026-06-07-latency-token-experimentation-plan.md b/docs/plans/2026-06-07-latency-token-experimentation-plan.md
diff --git a/docs/superpowers/specs/2026-06-09-bench-judge-edit-flow-design.md b/docs/superpowers/specs/2026-06-09-bench-judge-edit-flow-design.md
@@ -0,0 +1,122 @@
+# Bench harness P1: correctness/fidelity judge + edit flow
+
+**Branch:** `perf/bench-p1` (stacked on `perf/reduce-latency` / PR #143) · **Date:** 2026-06-09 · **Status:** draft
+
+Extends the latency/token experiment harness (`scripts/experiments/`, plan
+`docs/plans/2026-06-07-latency-token-experimentation-plan.md`) with the two pieces
+P1 explicitly defers: a **real correctness gate + fidelity judge**, and an
+**edit (two-turn) flow**. These are the "output fidelity/correctness/usability"
+and "first-generation **and** edits" axes of the climb.
+
+## 1. Why
+
+Today a cell counts as **success** iff the runner exited 0 and made ≥1 LLM call
+(`cell_record.py:53`). Task `rubric`s are loaded but unused. Consequences:
+- Token/latency deltas are untrustworthy — a fix can "win" by emitting junk cheaply.
+- There is no edit workload at all, so the biggest per-edit token lever
+  (`patch` vs re-`push`, plan T4) cannot be measured.
+
+## 2. Scope
+
+### In
+1. **`judge.py`** — verdict for a produced artifact against a task:
+   - **Rules gate** (pure stdlib, offline, the cheap pre-filter):
+     - *terminal*: a rendered artifact was captured, non-empty, not an error/usage
+       string, contains box-drawing or ASCII structure, exit 0.
+     - *viewer*: captured spec parses as JSON, matches the task `type`'s minimal
+       shape (e.g. `flow` → `{nodes[], edges[]}`; `component` → a `{type,...}` tree;
+       `vegalite` → has `$schema`/`mark`/`encoding`; `status` → ≥1 status event),
+       and is within the `MAX_*` limits (nodes/edges/body/depth).
+   - **LLM rubric judge** (only for artifacts passing rules): build a strict prompt
+     (task prompt + `rubric` + artifact, truncated) → injected `call_fn` → parse a
+     strict-JSON verdict `{rubric_pass: bool, fidelity: 0..100, reason: str}`.
+     `call_fn` is injected so the rules path and parsing are unit-testable with a
+     mock; the real impl posts to the proxy `judge-model` alias.
+   - `evaluate(task, artifact, *, call_fn) -> Verdict`: rules first; judge only if
+     rules pass; `success = rules_pass and rubric_pass`.
+2. **`capture.py`** — `load_artifact(out_dir, task, *, phase) -> Artifact|None`:
+   terminal reads `$OUT/*.txt`/`*.mmd` then falls back to the runner transcript;
+   viewer reads `$OUT/pulled[.<phase>].json` (written by the entrypoint `pull`).
+3. **Judge model** — `judge-model` alias in `litellm.config.yaml` →
+   `os.environ/EXPERIMENT_JUDGE_MODEL` (default the stronger model, e.g.
+   `vertex_ai/gemini-2.5-pro`), matched before the `"*"` wildcard. `config.py`
+   gains `JUDGE_MODEL`/`EXPERIMENT_JUDGE_MODEL`. Judge runs **after** the runner's
+   spend window (entrypoint records `SPEND_END` at runner finish) so judge tokens
+   are excluded from diagram cost; judge cost recorded separately as `judge_tokens`.
+4. **`termchart` shim** — `podman/tc-shim.sh` installed first on PATH in the cell;
+   appends `argv` to `$OUT/tc-calls.log` then execs the real CLI. Yields
+   `termchart_calls`, push/patch/lint/render counts, and `edit_via_patch`.
+5. **Edit flow** — `Task` gains optional `edit: {instruction, rubric}`. When present
+   the cell runs **two turns** in the same workdir/board: gen, then edit. Records
+   per-phase tokens/latency; `edit_via_patch` from the shim log; post-edit judge.
+6. **Wiring** — extend `RunRecord` (below); `summarize`/report.md gain a fidelity
+   column and an edits section; dry-run `synthesize` produces fidelity + edit fields
+   so the offline pipeline and tests still exercise everything; `cell_record.py`
+   and `agent_run.invoke_runner` both call the shared `judge`/`capture` core.
+7. **Tasks** — add an `edit` to ≥3 pilot tasks (flow/component/status) and ensure
+   rubrics are judge-ready.
+8. **Tests** — judge rules on fixtures (good / empty / wrong-type / oversized /
+   error-text); judge parsing with a mock `call_fn` (valid, malformed, refusal);
+   shim-log parser; edit-cell record shape; dry-run report still green; existing 20
+   tests stay green.
+
+### Deferred (explicitly out)
+- Visual/screenshot fidelity (gemini-ux-study) — judge the structured spec for now.
+- AGY runner (still blocked upstream).
+- The remaining T*/L* fixes (separate PRs, gated on this judge producing signal).
+- GCS results store at scale.
+
+## 3. `RunRecord` additions
+
+```
+fidelity: int = 0           # 0..100, LLM judge (first-gen / final artifact)
+rubric_pass: bool = False   # judge boolean on the rubric
+rules_pass: bool = False    # deterministic gate
+judge_model: str = ""
+judge_tokens: int = 0       # judge cost, NOT counted in diagram cost
+# edit phase (only when task.edit present)
+is_edit: bool = False
+edit_tokens_total: int = 0
+edit_latency_ms: int = 0
+edit_via_patch: bool = False
+post_edit_fidelity: int = 0
+post_edit_success: bool = False
+```
+
+`success` becomes `rules_pass and rubric_pass` (judge disabled → `rules_pass`
+only, with `outcome="rules-only"`). `summarize()` already restricts cost metrics to
+`success` records, so this immediately makes existing stats honest. New report
+metric `fidelity` is summarized as median (higher = better; excluded from
+`LOWER_IS_BETTER`).
+
+## 4. Data flow (podman cell, edit task)
+
+```
+entrypoint: build cond → stage $AW → install tc-shim on PATH
+  turn1: run runner(prompt)         → tc-calls.log, runner.json, *.txt
+         pull --json → pulled.gen.json ;  SPEND_MID = wc -l spend
+  turn2: run runner(edit.instruction)→ appends tc-calls.log
+         pull --json → pulled.json    ;  SPEND_END = wc -l spend
+cell_record:
+  gen cost   = slice[SPEND_START:SPEND_MID]
+  edit cost  = slice[SPEND_MID:SPEND_END]
+  artifact   = capture.load_artifact($OUT, task, phase=final)
+  verdict    = judge.evaluate(task, artifact, call_fn=proxy(judge-model))
+  edit_via_patch = "patch" in tc-calls.log after turn1
+  judge cost = slice[SPEND_END:EOF]   (kept out of diagram cost)
+  → RunRecord → cells/<id>.json + agent.jsonl
+```
+
+First-gen-only tasks skip turn2 / edit fields (back-compatible).
+
+## 5. Risks
+
+- **Weak-judge bias** → separate stronger `judge-model`; rules gate catches gross
+  failures deterministically regardless.
+- **Judge cost leaking into diagram cost** → bounded spend slices; judge runs last.
+- **Patch detection via shim is coarse** (counts a `patch` subcommand, not semantic
+  correctness of the patch) → documented; post-edit judge still grades the result.
+- **Two-turn state** depends on the board/file persisting between turns (same `$AW`,
+  same viewer board) → asserted in the entrypoint; terminal edits re-render (no patch).
+- **Offline testability** → all LLM calls injected/mocked; rules + dry-run cover CI.
+```