Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Supported subset: `graph TD|LR` nodes `A[x]` `B{decision}` `C[(db)]` edges `-->`
Push native JSON to a running viewer — run it locally
(`PUSH_TOKEN=dev PORT=8080 node packages/viewer/dist/server.js`, open `http://localhost:8080/w/me/`;
the viewer ships in this repo, not the npm CLI) or point at a deployed one. Configured by two
env vars — **if either is unset, `push`/`status` exit 3** with `Set TERMCHART_VIEWER_URL and TERMCHART_VIEWER_TOKEN.`:
env vars — **if either is unset, `push`/`status` exit 4** with `…are not set: no termchart viewer configured.`:

```bash
export TERMCHART_VIEWER_URL="http://localhost:8080/w/me"
Expand Down
288 changes: 288 additions & 0 deletions docs/plans/2026-06-07-latency-token-experimentation-plan.md

Large diffs are not rendered by default.

122 changes: 122 additions & 0 deletions docs/superpowers/specs/2026-06-09-bench-judge-edit-flow-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Bench harness P1: correctness/fidelity judge + edit flow

**Branch:** `perf/bench-p1` (stacked on `perf/reduce-latency` / PR #143) · **Date:** 2026-06-09 · **Status:** draft

Extends the latency/token experiment harness (`scripts/experiments/`, plan
`docs/plans/2026-06-07-latency-token-experimentation-plan.md`) with the two pieces
P1 explicitly defers: a **real correctness gate + fidelity judge**, and an
**edit (two-turn) flow**. These are the "output fidelity/correctness/usability"
and "first-generation **and** edits" axes of the climb.

## 1. Why

Today a cell counts as **success** iff the runner exited 0 and made ≥1 LLM call
(`cell_record.py:53`). Task `rubric`s are loaded but unused. Consequences:
- Token/latency deltas are untrustworthy — a fix can "win" by emitting junk cheaply.
- There is no edit workload at all, so the biggest per-edit token lever
(`patch` vs re-`push`, plan T4) cannot be measured.

## 2. Scope

### In
1. **`judge.py`** — verdict for a produced artifact against a task:
- **Rules gate** (pure stdlib, offline, the cheap pre-filter):
- *terminal*: a rendered artifact was captured, non-empty, not an error/usage
string, contains box-drawing or ASCII structure, exit 0.
- *viewer*: captured spec parses as JSON, matches the task `type`'s minimal
shape (e.g. `flow` → `{nodes[], edges[]}`; `component` → a `{type,...}` tree;
`vegalite` → has `$schema`/`mark`/`encoding`; `status` → ≥1 status event),
and is within the `MAX_*` limits (nodes/edges/body/depth).
- **LLM rubric judge** (only for artifacts passing rules): build a strict prompt
(task prompt + `rubric` + artifact, truncated) → injected `call_fn` → parse a
strict-JSON verdict `{rubric_pass: bool, fidelity: 0..100, reason: str}`.
`call_fn` is injected so the rules path and parsing are unit-testable with a
mock; the real impl posts to the proxy `judge-model` alias.
- `evaluate(task, artifact, *, call_fn) -> Verdict`: rules first; judge only if
rules pass; `success = rules_pass and rubric_pass`.
2. **`capture.py`** — `load_artifact(out_dir, task, *, phase) -> Artifact|None`:
terminal reads `$OUT/*.txt`/`*.mmd` then falls back to the runner transcript;
viewer reads `$OUT/pulled[.<phase>].json` (written by the entrypoint `pull`).
3. **Judge model** — `judge-model` alias in `litellm.config.yaml` →
`os.environ/EXPERIMENT_JUDGE_MODEL` (default the stronger model, e.g.
`vertex_ai/gemini-2.5-pro`), matched before the `"*"` wildcard. `config.py`
gains `JUDGE_MODEL`/`EXPERIMENT_JUDGE_MODEL`. Judge runs **after** the runner's
spend window (entrypoint records `SPEND_END` at runner finish) so judge tokens
are excluded from diagram cost; judge cost recorded separately as `judge_tokens`.
4. **`termchart` shim** — `podman/tc-shim.sh` installed first on PATH in the cell;
appends `argv` to `$OUT/tc-calls.log` then execs the real CLI. Yields
`termchart_calls`, push/patch/lint/render counts, and `edit_via_patch`.
5. **Edit flow** — `Task` gains optional `edit: {instruction, rubric}`. When present
the cell runs **two turns** in the same workdir/board: gen, then edit. Records
per-phase tokens/latency; `edit_via_patch` from the shim log; post-edit judge.
6. **Wiring** — extend `RunRecord` (below); `summarize`/report.md gain a fidelity
column and an edits section; dry-run `synthesize` produces fidelity + edit fields
so the offline pipeline and tests still exercise everything; `cell_record.py`
and `agent_run.invoke_runner` both call the shared `judge`/`capture` core.
7. **Tasks** — add an `edit` to ≥3 pilot tasks (flow/component/status) and ensure
rubrics are judge-ready.
8. **Tests** — judge rules on fixtures (good / empty / wrong-type / oversized /
error-text); judge parsing with a mock `call_fn` (valid, malformed, refusal);
shim-log parser; edit-cell record shape; dry-run report still green; existing 20
tests stay green.

### Deferred (explicitly out)
- Visual/screenshot fidelity (gemini-ux-study) — judge the structured spec for now.
- AGY runner (still blocked upstream).
- The remaining T*/L* fixes (separate PRs, gated on this judge producing signal).
- GCS results store at scale.

## 3. `RunRecord` additions

```
fidelity: int = 0 # 0..100, LLM judge (first-gen / final artifact)
rubric_pass: bool = False # judge boolean on the rubric
rules_pass: bool = False # deterministic gate
judge_model: str = ""
judge_tokens: int = 0 # judge cost, NOT counted in diagram cost
# edit phase (only when task.edit present)
is_edit: bool = False
edit_tokens_total: int = 0
edit_latency_ms: int = 0
edit_via_patch: bool = False
post_edit_fidelity: int = 0
post_edit_success: bool = False
```

`success` becomes `rules_pass and rubric_pass` (judge disabled → `rules_pass`
only, with `outcome="rules-only"`). `summarize()` already restricts cost metrics to
`success` records, so this immediately makes existing stats honest. New report
metric `fidelity` is summarized as median (higher = better; excluded from
`LOWER_IS_BETTER`).

## 4. Data flow (podman cell, edit task)

```
entrypoint: build cond → stage $AW → install tc-shim on PATH
turn1: run runner(prompt) → tc-calls.log, runner.json, *.txt
pull --json → pulled.gen.json ; SPEND_MID = wc -l spend
turn2: run runner(edit.instruction)→ appends tc-calls.log
pull --json → pulled.json ; SPEND_END = wc -l spend
cell_record:
gen cost = slice[SPEND_START:SPEND_MID]
edit cost = slice[SPEND_MID:SPEND_END]
artifact = capture.load_artifact($OUT, task, phase=final)
verdict = judge.evaluate(task, artifact, call_fn=proxy(judge-model))
edit_via_patch = "patch" in tc-calls.log after turn1
judge cost = slice[SPEND_END:EOF] (kept out of diagram cost)
→ RunRecord → cells/<id>.json + agent.jsonl
```

First-gen-only tasks skip turn2 / edit fields (back-compatible).

## 5. Risks

- **Weak-judge bias** → separate stronger `judge-model`; rules gate catches gross
failures deterministically regardless.
- **Judge cost leaking into diagram cost** → bounded spend slices; judge runs last.
- **Patch detection via shim is coarse** (counts a `patch` subcommand, not semantic
correctness of the patch) → documented; post-edit judge still grades the result.
- **Two-turn state** depends on the board/file persisting between turns (same `$AW`,
same viewer board) → asserted in the entrypoint; terminal edits re-render (no patch).
- **Offline testability** → all LLM calls injected/mocked; rules + dry-run cover CI.
```
Loading
Loading