AI Token Efficiency Benchmark

Measures how many AI tokens (and dollars) it takes to build and iteratively extend the same application in different frameworks, using Claude Code in headless mode with multi-model delegation.

The Question

Does a simpler framework API surface lead to more efficient AI-assisted development? We tested this by having Claude build the same Conference Manager API in three frameworks, then add 5 features incrementally - measuring cost, speed, and correctness at each step.

The Conclusion

The short answer: framework choice matters more as complexity grows. At greenfield scale (6 entities), all frameworks are roughly tied (~6% spread). Across the five feature additions (growing to 10 entities, 117 tests), Brace costs about 31% less in cumulative feature cost than Spring Boot, and is cheaper on every round. Documentation quality and test suite quality remain the highest-leverage factors overall. Hono performs comparably to Brace on token cost ($5.79 vs $5.62), reinforcing that minimal, explicit frameworks are more token-efficient than layered, convention-heavy ones - regardless of language. The consistency across two independent minimal frameworks in different ecosystems strengthens the case against Spring's higher cost.

Frameworks

Framework	Language	Style	Why Included
Brace	Java 21	Minimal, explicit, no DI, no ORM magic	Type-safe, high-performance language with an explicitly AI-optimized framework
Spring Boot 3.4	Java 21	Full-featured, annotations, Spring Data JPA	Industry-standard Java framework with massive AI training data
Hono	TypeScript	Minimal, explicit, raw SQL via better-sqlite3	Minimal framework in a different type-safe language - tests whether simplicity or language familiarity drives efficiency

Methodology

Three-Phase Approach

Planning - Claude Opus reads the spec and creates a task breakdown, assigning each task to the cheapest capable model (Haiku for boilerplate, Sonnet for business logic, Opus for architecture)
Execution - Opus executes its plan, delegating tasks to subagents running in parallel where possible
Fix loops - If compilation fails or tests fail, a single model fixes issues directly in the same session (no delegation - fixes need full codebase context). The fix model is configurable (--fix-model).

The App: Conference Manager

A JSON REST API starting with 6 entities, CRUD endpoints, scheduling conflict detection, room capacity enforcement, and schedule query endpoints. Five feature additions progressively grow the app to 10 entities with 117 integration tests. See spec.md for the initial build spec.

Chosen because it has PetClinic-level complexity but isn't in AI training data, and the scheduling logic requires real programming beyond basic CRUD.

Feature Addition Rounds

Each round adds a feature to the existing codebase. The AI must read and understand the current implementation, add new functionality, and ensure all existing tests continue to pass.

Round	Feature	New Entity	Key Challenge	Cumulative Tests
F1	Speaker Availability	Availability	Modify existing talk validation	50
F2	Waitlist + Auto-Promotion	WaitlistEntry	Cross-entity mutation on registration delete	63
F3	Ratings & Speaker Stats	Rating	Aggregate queries, modify response shapes	80
F4	Multi-Day Events & Tracks	Track	Modify existing entity + validation + schedule format	101
F5	Notifications & Activity Feed	Notification	Hook into 4 existing flows (registration, waitlist, talks, ratings)	117

Fairness

All frameworks receive the identical spec prompt
All have a CLAUDE.md with equivalent documentation depth (~200-230 lines each)
All use in-memory databases (H2 for Java, SQLite for TypeScript)
All start from a minimal template (build config + empty entry point)
Same model (Opus) does planning and orchestration for all
Same test suite (Python HTTP tests) runs against all
Feature rounds build on AI-generated code (not hand-written), so both the starting codebase and the additions are AI-produced under identical conditions

Provenance of the Published Numbers

The results below are a snapshot in time, not a guarantee of bit-for-bit reproduction. They were produced with:

Brace 0.1.0-SNAPSHOT - an early pre-release. The current template pins v0.1.6; the Brace API has since changed (notably the request-param methods and the Maven/package namespace, io.brace → com.larvalabs).
Claude models claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5 (the model IDs recorded in results/*.json).
API pricing as of the run dates (April 2026).

Re-running today against the pinned Brace v0.1.6 and current Claude models will produce different absolute dollar figures - model versions, pricing, and run-to-run variance all move. The value of this benchmark is the relative comparison between frameworks under identical conditions and the methodology, not the exact dollar amounts.

Results: Greenfield Build

	Brace	Spring Boot	Hono
Tests passed (first attempt)	33/35	33/35	33/35
Tests passed (final)	35/35	35/35	35/35
Fix loop attempts	1	1	1
Total cost	$2.24	$2.38	$2.75
Wall clock	10.1 min	8.4 min	9.0 min
Lines of code	562	914	518

All three frameworks produced nearly identical results. The 6% cost difference between Brace and Spring is within noise for a single run.

Results: Feature Additions (F1–F5)

All runs used Opus for fix loops (--fix-model opus).

Cost Per Feature

Round	Feature	Brace	Spring	Hono
F1	Speaker Availability	$1.01	$1.59	$1.48
F2	Waitlist + Auto-Promotion	$1.02	$1.14	$0.85
F3	Ratings & Speaker Stats	$0.75	$1.18	$0.91
F4	Multi-Day Events & Tracks	$1.29	$1.96	$1.37
F5	Notifications & Activity Feed	$1.54	$2.29	$1.18
	Cumulative (F1-F5)	$5.62	$8.16	$5.79

First-Attempt Accuracy

All F1-F5 runs achieved 100% test pass on first attempt (0 fix loops) across all three frameworks, after clarifying an ambiguous edge case in the F1 spec. F2-F5 were clean from the start. See Caveats for notes on run-to-run variance.

Brace vs Spring Savings by Round

Round	Saving
F1	36%
F2	11%
F3	36%
F4	34%
F5	33%

Brace is cheaper on every round. The saving is fairly stable around a third (F2 is a low outlier), rather than widening monotonically - single-run variance per round is real.

Codebase Size at Each Stage

Stage	Brace (LOC / files)	Spring (LOC / files)	Hono (LOC / files)
After F1	862 / 16	1,180 / 22	732 / 9
After F2	942 / 18	1,362 / 25	827 / 10
After F3	1,060 / 20	1,587 / 28	926 / 11

Spring's codebase is consistently 1.4–1.5x larger than Brace's due to the repository layer, additional annotations, and service classes. This directly translates to more tokens the AI must read when adding features.

Model Delegation Patterns

Opus autonomously decides how to delegate tasks to subagents. The model choices reveal what each framework demands:

Output tokens by model per round:

Round	Model	Brace	Spring	Hono
Greenfield	Opus	22,334	22,210	28,034
	Sonnet	5,744	9,052	7,224
	Haiku	8,860	11,040	0
F1	Opus	9,741	14,424	16,202
	Sonnet	1,244	1,145	1,206
	Haiku	1,494	5,542	5,013
F2	Opus	9,126	9,703	6,169
	Sonnet	1,821	5,167	2,150
	Haiku	768	3,247	3,447
F3	Opus	5,161	7,785	7,217
	Sonnet	2,314	8,121	1,841
	Haiku	1,659	1,442	4,487
F4	Opus	10,243	16,102	10,650
	Sonnet	6,668	3,937	7,491
	Haiku	291	4,664	2,229
F5	Opus	12,535	18,702	9,527
	Sonnet	4,565	4,325	3,463
	Haiku	1,754	6,672	4,154

Key patterns:

Opus (the cost driver): Spring consistently requires more Opus tokens - by F5 it's using 49% more than Brace (18,702 vs 12,535). The orchestrator has to read and reason about a larger, more layered codebase on every turn.
Haiku (cheapest model): In the greenfield build, Brace delegated heavily to Haiku (8,860 tokens) - simple lambda handlers and entity classes were straightforward enough for the cheapest model. Hono used no Haiku at all - Opus kept everything on Opus and Sonnet. In feature additions, Haiku usage drops across all frameworks as work shifts from generating boilerplate to modifying existing logic.
The real takeaway: The absolute savings from cheaper model delegation are small - Opus orchestration dominates cost regardless. But the relative advantage of a smaller codebase (fewer tokens for the orchestrator to read) persists regardless of pricing. The 1.5x codebase size difference between Spring and Brace is a structural property, not a pricing artefact.

Runtime Performance

Token efficiency is one axis. Runtime performance is another.

Brace and Spring are both Java - their runtime performance is comparable. Both use Hibernate for DB access in typical applications, both run on modern JVM with virtual threads. Brace's HTTP layer (Jetty 12) is 10-20% faster than Spring's (Tomcat/Undertow) on pure HTTP workloads. For DB-heavy workloads they're similar. (Detailed Brace vs Spring benchmarks)

Hono runs on JavaScript/TypeScript runtimes (Node.js, Deno, or Bun), which are fundamentally slower than Java for sustained server workloads. On TechEmpower Round 23 (identical hardware), Java frameworks handle roughly 3x the throughput of JS frameworks on realistic DB workloads - Spring at 244k req/s vs Express at 78k req/s on the Fortunes test. This is a consistent pattern across every TFB round and test type.

The tradeoff:

	Token Cost (F1-F5)	Runtime Throughput	Ecosystem
Brace	$5.62 (31% less than Spring)	High (Java)	Java
Spring	$8.16	High (Java)	Java
Hono	$5.79 (29% less than Spring)	Moderate (~3x slower than Java)	TypeScript

For projects that need both token efficiency and runtime performance, Brace is the clear choice. For projects where throughput isn't the bottleneck, Hono offers comparable token savings in the TypeScript ecosystem.

Key Findings

Framework choice matters more as complexity grows

At greenfield scale, framework choice accounts for ~6% cost variation - swamped by run-to-run noise. But once features accumulate and the AI must read and modify a growing codebase, a consistent gap opens up: Brace is cheaper on every round (11–36% per round), for a cumulative saving of 31% ($5.62 vs $8.16).

This makes sense mechanically: Spring's layered architecture (Controller → Service → Repository → Entity → DTO) means the AI reads and modifies more files per feature. Brace's flat structure (Controller → Entity) keeps the context smaller. The cost grows with the number of files the AI must load into context on each turn.

Hono is cost-competitive, but trades runtime performance

Hono's cumulative feature cost ($5.79) is comparable to Brace ($5.62). It benefits from both API simplicity AND extensive AI training data. However, Hono/Node.js can't match Java's runtime throughput under high concurrency (see Runtime Performance above). For projects where runtime performance matters, Brace offers most of Hono's token efficiency with Java's performance characteristics.

Documentation quality remains the highest-leverage investment

Our first Brace benchmark run cost $4.45 due to bugs in the CLAUDE.md - wrong parameter syntax, missing dependency, no column naming guidance. After fixing these, the cost dropped to $1.79. Bad docs cost 2.5x more than good docs. This dwarfs any framework design advantage, especially at small scale.

Test suite quality is the second highest-leverage investment

Hardcoded auto-increment IDs in the test suite caused fix loops to waste tokens working around ID mismatches rather than fixing real bugs. Fixing the tests to use returned IDs eliminated these wasted loops across all frameworks.

Run-to-run variance is real but diminishing

Single-run comparisons are noisy - the same ambiguous spec interpretation can cascade into 10-15 test failures. However, the variance matters less as the dataset grows. With 5 feature rounds, the per-round noise averages out and the trend is clear: Brace consistently costs less than Spring, with the gap widening.

Cache reads dominate cost

The biggest cost driver isn't output tokens - it's the orchestrator re-reading the growing conversation on each turn. This is why codebase size matters: a larger codebase means more tokens loaded into context on every read, which compounds across the planning, execution, and fix phases.

Fix Loop Model Selection

All the feature addition results above use Opus for fix loops (--fix-model opus). Earlier greenfield runs tested Sonnet as the fix model.

Greenfield: Sonnet vs Opus for fixes

Caveat: This data is from early runs that had since-fixed issues (hardcoded test IDs, weak process cleanup). Sonnet's 3-attempt struggle on Hono may have been caused by a test infrastructure bug rather than a language-specific weakness. We haven't re-run this comparison with the fixed harness.

The same bug (overlap boundary >= vs >) hit all three frameworks. The fix model changed the outcome:

	Sonnet fix cost	Sonnet attempts	Opus fix cost	Opus attempts
Brace	$0.46	1	$1.04	1
Spring	$0.56	1	$1.10	1
Hono	$1.49	3	$0.87	1

Sonnet fixed the Java frameworks in one attempt at roughly half the cost of Opus. On this particular run, Sonnet struggled with Hono - it regressed by removing a conflict check, then hallucinated a "stale data" diagnosis, needing 3 attempts. Opus fixed the same bug in 1 attempt. Note that Sonnet handles Hono feature development fine (see the model delegation table above, where Hono regularly delegates to Sonnet) - debugging existing code may be a harder task than generating new code, though this is a single data point and may not generalize.

Why we used Opus for feature additions

We chose Opus as the fix model for F1–F5 to be conservative. In practice, the fix model barely mattered - most rounds needed zero fix loops across all frameworks. With good specs and docs, first-attempt success is the norm and the fix model is rarely invoked.

Running the Benchmark

Prerequisites

Java 21+
Maven
Node.js 18+ and npm (for Hono)
Python 3.10+ with pip install -r tests/requirements.txt
Claude Code CLI

The Brace track depends on com.larvalabs:brace:0.1.6, which is not published to Maven Central. Before running the Brace benchmark, clone larvalabs/brace, check out the matching tag, and install it to your local Maven repository:

git clone https://github.com/larvalabs/brace.git
cd brace && git checkout v0.1.6 && mvn install -DskipTests

The template pins v0.1.6 (the latest release at the time this harness was last updated). To run against a newer Brace, bump the version in brace-template/pom.xml and re-check the brace-template/CLAUDE.md against the current API - the request-param methods, in particular, have changed across releases.

Commands

# Full greenfield benchmark (3 runs, all frameworks, parallel)
./run.sh

# Single greenfield run, single framework
./run.sh 1 --brace

# Feature 1 (Speaker Availability) - requires completed greenfield
./run-feature.sh 1 --fix-model opus

# Feature chain (F2-F5) - each builds on previous
./run-chain.sh 2 3 4 5 --fix-model opus

# Single feature, single framework
./run-chain.sh 4 --spring --fix-model opus

# Custom fix model
./run-chain.sh 2 3 --brace --fix-model sonnet

Output

results/<framework>-run<N>.json - greenfield metrics
results/<framework>-feature-run<N>.json - F1 metrics
results/<framework>-feature<N>-run<M>.json - F2-F5 metrics
work/<framework>-<phase>-run<N>/ - generated code, plans, execution logs

Structure

spec.md                         # Greenfield prompt (identical for all)
feature-spec.md                 # F1: Speaker Availability
feature-spec-2-waitlist.md      # F2: Waitlist + Auto-Promotion
feature-spec-3-ratings.md       # F3: Ratings & Speaker Stats
feature-spec-4-multiday.md      # F4: Multi-Day Events & Tracks
feature-spec-5-notifications.md # F5: Notifications & Activity Feed
tests/
  test_conference.py            # 35 base tests
  test_feature_availability.py  # +15 (F1)
  test_feature_waitlist.py      # +13 (F2)
  test_feature_ratings.py       # +17 (F3)
  test_feature_multiday.py      # +21 (F4)
  test_feature_notifications.py # +16 (F5)
  requirements.txt
brace-template/                 # Starting point for Brace
spring-template/                # Starting point for Spring Boot
hono-template/                  # Starting point for Hono
run.sh                          # Greenfield benchmark
run-feature.sh                  # F1 benchmark
run-chain.sh                    # F2-F5 chained benchmark
results/                        # JSON output from each run

Caveats

Most comparisons are single-run per feature round - variance is real, though 5 data points per framework provide a trend
A novel framework (Brace) vs well-known ones (Spring, Hono) - AI has extensive Spring/Hono training data but only CLAUDE.md for Brace. This disadvantages Brace, making the cost savings more notable
Cost depends on API pricing which changes over time
The Conference Manager is a mid-complexity CRUD app - results may differ for very different app types (heavy computation, real-time systems, etc.)
All three frameworks reached 100% test pass rate eventually - the difference is in cost to get there, not capability

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
brace-template		brace-template
hono-template		hono-template
results		results
spring-template		spring-template
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
feature-spec-2-waitlist.md		feature-spec-2-waitlist.md
feature-spec-3-ratings.md		feature-spec-3-ratings.md
feature-spec-4-multiday.md		feature-spec-4-multiday.md
feature-spec-5-notifications.md		feature-spec-5-notifications.md
feature-spec.md		feature-spec.md
run-chain.sh		run-chain.sh
run-feature.sh		run-feature.sh
run.sh		run.sh
spec.md		spec.md

Folders and files

Latest commit

History

Repository files navigation

AI Token Efficiency Benchmark

Contents

The Question

The Conclusion

Frameworks

Methodology

Three-Phase Approach

The App: Conference Manager

Feature Addition Rounds

Fairness

Provenance of the Published Numbers

Results: Greenfield Build

Results: Feature Additions (F1–F5)

Cost Per Feature

First-Attempt Accuracy

Brace vs Spring Savings by Round

Codebase Size at Each Stage

Model Delegation Patterns

Runtime Performance

Key Findings

Framework choice matters more as complexity grows

Hono is cost-competitive, but trades runtime performance

Documentation quality remains the highest-leverage investment

Test suite quality is the second highest-leverage investment

Run-to-run variance is real but diminishing

Cache reads dominate cost

Fix Loop Model Selection

Greenfield: Sonnet vs Opus for fixes

Why we used Opus for feature additions

Running the Benchmark

Prerequisites

Commands

Output

Structure

Caveats

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages