Skip to content

Experiment: latency/token reduction for diagram generation — pilot findings #151

@ivanmkc

Description

@ivanmkc

Experiment: reduce termchart diagram-generation latency + tokens

Measuring the end-to-end cost for a coding agent to generate a termchart diagram, across multiple runners, and testing whether candidate fixes reduce it. Full plan: docs/plans/2026-06-07-latency-token-experimentation-plan.md (branch perf/reduce-latency, PR #143).

Setup (what actually ran)

  • Substrate: local podman containers (one long-lived LiteLLM proxy container + one isolated container per cell).
  • Measurement spine: every runner is pointed at the same LiteLLM proxy, so they share one model and tokens/latency are captured uniformly (provider-neutral) and correlated per cell by spend-log slice.
  • Backend model (shared, held constant): Gemini 2.5 Flash on Vertex (vertex_ai/gemini-2.5-flash) via ADC.
    • Note: Claude on Vertex is not enabled in the project (404 in all regions; needs the Anthropic Model Garden EULA accepted in the console). Switching to Claude is a one-line EXPERIMENT_MODEL flip once enabled.
  • Conditions: baseline = master; c1 = master + the two fixes below.
  • Tasks: terminal-er-orders, component-plan-comparison, flow-auth-callgraph (from scripts/experiments/tasks/pilot.jsonl).
  • Matrix: 2 runners × 2 conditions × 3 tasks × 2 reps = 24 cells, 100% success.

Each cell: container builds termchart at the condition ref, installs the CLI, runs the runner headless (as non-root node), the agent uses termchart to render a diagram, and a RunRecord is emitted (tokens from the proxy, wall-clock latency, success gate = clean exit + ≥1 model call).

Headline finding — runner choice dominates cost

condition runner median input tokens median latency
baseline claude-code 29,510 15.2s
baseline opencode 8,522 3.5s
c1 claude-code 38,986 14.8s
c1 opencode 8,742 4.4s

OpenCode uses ~3–4× fewer input tokens than Claude Code for identical diagram tasks — smaller system prompt + far less repo exploration. The runner is a bigger lever than the termchart-side fixes tried so far.

Fix effect (c1 vs baseline)

metric Δ significant?
median tokens −1.0% no (CIs overlap)
median latency +62.8% no (small-N noise, heavy tails)

c1 = T1 (minify diagram example JSONs, −45% bytes) + T5 (correct stale AGENTS.md exit code). This null result is expected and honest: T1 only saves tokens when an agent actually reads a large example, and T5 is a one-character correctness fix — neither materially affects these tasks. The pilot's value is a working, reproducible measurement apparatus + baselines.

Runner status

Runner Headless Routed via proxy Status
Claude Code claude -p … --permission-mode bypassPermissions (runs as non-root node) ANTHROPIC_BASE_URL → LiteLLM ✅ working
OpenCode opencode run … --model openai/shared-model openai provider baseURL → LiteLLM ✅ working
AGY (Antigravity) agy -p … --dangerously-skip-permissions ⛔ deferred — agy 1.0.6 exists but has no custom base-URL flag, so it can't share the proxy/model or be token-measured uniformly; also needs ANTIGRAVITY_API_KEY

Fixes shipped (PRs into master, not merged)

Conclusions

  1. The experiment apparatus works end-to-end on local podman with real Vertex Gemini, across two runners.
  2. The fixes tried so far (T1, T5) are good hygiene but do not move tokens on these tasks.
  3. The dominant cost driver is the runner's own overhead (system prompt + exploration), with Claude Code ≫ OpenCode.

Next

To get a real token delta, target the always-loaded path:

  • T2 — slim AGENTS.md / SKILL.md (cut what every session loads).
  • T3 — tighter routing so the agent reads exactly one detail file + one example.
  • Reduce agent repo-exploration (run tasks in a neutral workspace with termchart installed, not inside the termchart source tree).
  • Add a matrix-heavy task (RACI/risk) so T1's minification is actually exercised.
  • Flip the shared model to Claude once Anthropic Model Garden is enabled, and scale reps for tighter CIs.

Reproduce: RUNNERS=claude-code,opencode CONDITIONS=baseline,c1 TASKS=… REPS=2 scripts/experiments/podman/run_local.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions