We are building the neutral measurement substrate for AI agents: the system that can answer, with execution-grounded evidence, "which (harness × model × provider × strategy) combination is actually best for task-class X" — cheaply, for any axis, on demand.
Everything else — a you.com comparison, a coding benchmark, a research leaderboard — is one cell of that matrix. The product is the measurer, not any single measurement.
Why this goal:
- Every buyer of AI infrastructure has this question and nobody neutral answers it. Vendors publish self-serving benchmarks; academic benchmarks go stale and don't cover the buyer's axis (their harness, their provider choice, their task-class). The neutral, current, per-axis measurer is the missing layer.
- Honesty is the moat. A measurer that reports "you tie here, you win there" is trustable; a marketing benchmark is not. A parity result is an asset for the substrate even when it disappoints a vendor — it's the proof we measure fairly. We never trade this away.
- It compounds with this repo's thesis. The substrate is the judge/corpus half of the learning flywheel (
docs/learning-flywheel.md): the same execution-grounded verdicts that rank providers are the training signal for self-improving agents. One investment, two products.
The product is the RSI agent runtime (architecture.md is the spine): a literal agent that acts, dispatches agents, carries a policy, parallelizes through the combinators, and is analyzed and improved from its traces. That program is the point. This document describes the layer that program needs and sells:
- The eval substrate is the grounding + the sellable exhaust. Self-improvement needs external, execution-grounded verdicts (a model grading itself is self-delusion — the same circularity as a model validating its own generated tests). The same leaderboard that grounds the loops is the commercial artifact: it runs plain harnesses today → agent profiles next → the RSI loops eventually, on the same benchmarks — the public, longitudinal proof the runtime delivers lift.
generate-evalis one skill — a software-3.0 leverage point we pick up along the way. It closes eval-scenario gaps by minting deployable checkers with correctable bands (the inputs the RSI gate has been starving for). It is an instance running on the runtime (aloopUntilwith a deterministic certifier as its validator), not a pillar beside it.
Why the alternatives are worse:
- "Hand-author benchmarks per engagement" — doesn't scale, goes stale the day a library ships a new release, and the authoring bottleneck caps how many axes we can sell.
- "Use existing public benchmarks" — stale by construction (models train on them), rarely cover the buyer's axis, and saturate (SimpleQA-style ceilings) or floor (research hard-mode), so they stop discriminating exactly where decisions get made.
- "Optimize one vendor-pleasing number" — destroys the only durable asset (neutrality), and dies the first time the honest answer is "parity."
An agent authors eval tasks; the runtime certifies them. A generated task is admitted only if it survives two machine gates:
- Grounding gate — the reference solution actually executes and passes against the real, pinned, current target (a real library in a sandbox; a citable primary source for QA). The library/source is the oracle — never the generating model's belief. This kills the circularity trap where a model "validates" a test against its own stale memory.
- Discrimination gate — a no-search/parametric baseline must fail the task. A task every model already solves measures nothing; admit only tasks with a correctable band.
Inputs that auto-target freshness: recent release notes / changelogs (new API surface is search-dependent by construction); existing repos with existing test suites (highest validity — reuse, don't reinvent tests).
The loop: author → execute → (fix until the reference builds and passes) → run parametric baseline → admit/reject. Iteration is expected; "didn't compile after 1 rollout" is a normal intermediate state, not a failure.
One corpus of cells: (task-class, harness, model, provider/arm) → {pass, lift-vs-baseline, paired discordants + exact sign-test p, tokens, latency, cost, citations, cited-authority, tools-actually-fired}. One exporter renders the cross-axis report (REPORT.md + results.csv + tasks.json + summary.json, zip/gist-shareable). Every aggregate must be reproducible from the raw cells in the same bundle.
The arms must be provably isolated: when an arm claims "provider P only," the tool-call log must show P's tools fired and the native tools did not. "Tools used" is a column, not an assumption.
- Deterministic or execution-grounded oracles. Exact-identifier checks, test suites, verifiable sources. An LLM judge is a last resort and is labeled as such.
- One variable per comparison. Within a comparison, only the arm changes (same harness, same model, same tasks). Cross-harness numbers are never presented as same-model comparisons.
- Paired statistics on the same tasks. Report discordant pairs and the exact sign test, not just pooled deltas. Small-n directional results are labeled directional.
- Discrimination before scale. Before any expensive run: a cheap pilot proving the axis can move (a parametric-fail baseline, a saturation check). Never spend on a saturated or floored axis.
- Infra errors are excluded and reported, never counted as failures. No silent zeros.
- The execution record is part of the result. Tool calls, tokens, latency, citations ship with every cell so "did it really search, with what, at what cost" is auditable.
- Honest verdicts, including parity and losses. The report states what was measured vs inferred, and its threats to validity, in the artifact itself.
The general build rules are single-sourced in BUILDING.md — goal-first, ground-truth
before claiming, check-existing before building, cheapest-decisive-check first. Substrate-specific
additions only:
- Secrets never print. Keys live in env/dotenvx; a leaked key in a trace is a stop-the-line defect.
- Honest reporting is the moat. A negative or null verdict ships with the same care as a win — the substrate's value is that its numbers can be believed.