Experiment: latency/token reduction for diagram generation — pilot findings

## Experiment: reduce termchart diagram-generation latency + tokens

Measuring the end-to-end cost for a coding agent to generate a termchart diagram, across multiple runners, and testing whether candidate fixes reduce it. Full plan: `docs/plans/2026-06-07-latency-token-experimentation-plan.md` (branch `perf/reduce-latency`, PR #143).

### Setup (what actually ran)
- **Substrate:** local **podman** containers (one long-lived LiteLLM proxy container + one isolated container per cell).
- **Measurement spine:** every runner is pointed at the **same** LiteLLM proxy, so they share one model and tokens/latency are captured uniformly (provider-neutral) and correlated per cell by spend-log slice.
- **Backend model (shared, held constant):** **Gemini 2.5 Flash on Vertex** (`vertex_ai/gemini-2.5-flash`) via ADC.
  - Note: Claude on Vertex is **not enabled** in the project (404 in all regions; needs the Anthropic Model Garden EULA accepted in the console). Switching to Claude is a one-line `EXPERIMENT_MODEL` flip once enabled.
- **Conditions:** `baseline` = `master`; `c1` = `master` + the two fixes below.
- **Tasks:** `terminal-er-orders`, `component-plan-comparison`, `flow-auth-callgraph` (from `scripts/experiments/tasks/pilot.jsonl`).
- **Matrix:** 2 runners × 2 conditions × 3 tasks × 2 reps = **24 cells, 100% success**.

Each cell: container builds termchart at the condition ref, installs the CLI, runs the runner **headless** (as non-root `node`), the agent uses termchart to render a diagram, and a `RunRecord` is emitted (tokens from the proxy, wall-clock latency, success gate = clean exit + ≥1 model call).

### Headline finding — runner choice dominates cost
| condition | runner | median input tokens | median latency |
|---|---|---|---|
| baseline | claude-code | 29,510 | 15.2s |
| baseline | opencode | 8,522 | 3.5s |
| c1 | claude-code | 38,986 | 14.8s |
| c1 | opencode | 8,742 | 4.4s |

**OpenCode uses ~3–4× fewer input tokens than Claude Code** for identical diagram tasks — smaller system prompt + far less repo exploration. The runner is a bigger lever than the termchart-side fixes tried so far.

### Fix effect (c1 vs baseline)
| metric | Δ | significant? |
|---|---|---|
| median tokens | **−1.0%** | no (CIs overlap) |
| median latency | +62.8% | no (small-N noise, heavy tails) |

c1 = **T1** (minify diagram example JSONs, −45% bytes) + **T5** (correct stale `AGENTS.md` exit code). This null result is **expected and honest**: T1 only saves tokens when an agent actually *reads* a large example, and T5 is a one-character correctness fix — neither materially affects these tasks. The pilot's value is a working, reproducible measurement apparatus + baselines.

### Runner status
| Runner | Headless | Routed via proxy | Status |
|---|---|---|---|
| Claude Code | `claude -p … --permission-mode bypassPermissions` (runs as non-root `node`) | `ANTHROPIC_BASE_URL` → LiteLLM | ✅ working |
| OpenCode | `opencode run … --model openai/shared-model` | openai provider `baseURL` → LiteLLM | ✅ working |
| AGY (Antigravity) | `agy -p … --dangerously-skip-permissions` | — | ⛔ deferred — `agy` 1.0.6 exists but has **no custom base-URL flag**, so it can't share the proxy/model or be token-measured uniformly; also needs `ANTIGRAVITY_API_KEY` |

### Fixes shipped (PRs into master, not merged)
- #147 — T5: correct `AGENTS.md` `push`/`status` exit code (4, not 3)
- #148 — T1: minify diagram example JSONs (~45% smaller; 305 KB → 168 KB)
- #143 — c1 integration branch + P0 harness + plan (draft)

### Conclusions
1. The experiment apparatus works end-to-end on local podman with real Vertex Gemini, across two runners.
2. The fixes tried so far (T1, T5) are good hygiene but **do not move tokens** on these tasks.
3. The dominant cost driver is the **runner's own overhead** (system prompt + exploration), with Claude Code ≫ OpenCode.

### Next
To get a real token delta, target the **always-loaded path**:
- **T2** — slim `AGENTS.md` / `SKILL.md` (cut what every session loads).
- **T3** — tighter routing so the agent reads exactly one detail file + one example.
- Reduce agent repo-exploration (run tasks in a neutral workspace with termchart installed, not inside the termchart source tree).
- Add a matrix-heavy task (RACI/risk) so T1's minification is actually exercised.
- Flip the shared model to Claude once Anthropic Model Garden is enabled, and scale reps for tighter CIs.

_Reproduce:_ `RUNNERS=claude-code,opencode CONDITIONS=baseline,c1 TASKS=… REPS=2 scripts/experiments/podman/run_local.sh`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: latency/token reduction for diagram generation — pilot findings #151

Experiment: reduce termchart diagram-generation latency + tokens

Setup (what actually ran)

Headline finding — runner choice dominates cost

Fix effect (c1 vs baseline)

Runner status

Fixes shipped (PRs into master, not merged)

Conclusions

Next

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

condition	runner	median input tokens	median latency
baseline	claude-code	29,510	15.2s
baseline	opencode	8,522	3.5s
c1	claude-code	38,986	14.8s
c1	opencode	8,742	4.4s

metric	Δ	significant?
median tokens	−1.0%	no (CIs overlap)
median latency	+62.8%	no (small-N noise, heavy tails)

Runner	Headless	Routed via proxy	Status
Claude Code	`claude -p … --permission-mode bypassPermissions` (runs as non-root `node`)	`ANTHROPIC_BASE_URL` → LiteLLM	✅ working
OpenCode	`opencode run … --model openai/shared-model`	openai provider `baseURL` → LiteLLM	✅ working
AGY (Antigravity)	`agy -p … --dangerously-skip-permissions`	—	⛔ deferred — `agy` 1.0.6 exists but has no custom base-URL flag, so it can't share the proxy/model or be token-measured uniformly; also needs `ANTIGRAVITY_API_KEY`

Experiment: latency/token reduction for diagram generation — pilot findings #151

Description

Experiment: reduce termchart diagram-generation latency + tokens

Setup (what actually ran)

Headline finding — runner choice dominates cost

Fix effect (c1 vs baseline)

Runner status

Fixes shipped (PRs into master, not merged)

Conclusions

Next

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions