Measures how many AI tokens (and dollars) it takes to build and iteratively extend the same application in different frameworks, using Claude Code in headless mode with multi-model delegation.
- The Question
- The Conclusion
- Frameworks
- Methodology
- Results: Greenfield Build
- Results: Feature Additions (F1–F5)
- Key Findings
- Fix Loop Model Selection
- Running the Benchmark
- Caveats
Does a simpler framework API surface lead to more efficient AI-assisted development? We tested this by having Claude build the same Conference Manager API in three frameworks, then add 5 features incrementally - measuring cost, speed, and correctness at each step.
The short answer: framework choice matters more as complexity grows. At greenfield scale (6 entities), all frameworks are roughly tied (~6% spread). Across the five feature additions (growing to 10 entities, 117 tests), Brace costs about 31% less in cumulative feature cost than Spring Boot, and is cheaper on every round. Documentation quality and test suite quality remain the highest-leverage factors overall. Hono performs comparably to Brace on token cost ($5.79 vs $5.62), reinforcing that minimal, explicit frameworks are more token-efficient than layered, convention-heavy ones - regardless of language. The consistency across two independent minimal frameworks in different ecosystems strengthens the case against Spring's higher cost.
| Framework | Language | Style | Why Included |
|---|---|---|---|
| Brace | Java 21 | Minimal, explicit, no DI, no ORM magic | Type-safe, high-performance language with an explicitly AI-optimized framework |
| Spring Boot 3.4 | Java 21 | Full-featured, annotations, Spring Data JPA | Industry-standard Java framework with massive AI training data |
| Hono | TypeScript | Minimal, explicit, raw SQL via better-sqlite3 | Minimal framework in a different type-safe language - tests whether simplicity or language familiarity drives efficiency |
- Planning - Claude Opus reads the spec and creates a task breakdown, assigning each task to the cheapest capable model (Haiku for boilerplate, Sonnet for business logic, Opus for architecture)
- Execution - Opus executes its plan, delegating tasks to subagents running in parallel where possible
- Fix loops - If compilation fails or tests fail, a single model fixes issues directly in the same session (no delegation - fixes need full codebase context). The fix model is configurable (
--fix-model).
A JSON REST API starting with 6 entities, CRUD endpoints, scheduling conflict detection, room capacity enforcement, and schedule query endpoints. Five feature additions progressively grow the app to 10 entities with 117 integration tests. See spec.md for the initial build spec.
Chosen because it has PetClinic-level complexity but isn't in AI training data, and the scheduling logic requires real programming beyond basic CRUD.
Each round adds a feature to the existing codebase. The AI must read and understand the current implementation, add new functionality, and ensure all existing tests continue to pass.
| Round | Feature | New Entity | Key Challenge | Cumulative Tests |
|---|---|---|---|---|
| F1 | Speaker Availability | Availability | Modify existing talk validation | 50 |
| F2 | Waitlist + Auto-Promotion | WaitlistEntry | Cross-entity mutation on registration delete | 63 |
| F3 | Ratings & Speaker Stats | Rating | Aggregate queries, modify response shapes | 80 |
| F4 | Multi-Day Events & Tracks | Track | Modify existing entity + validation + schedule format | 101 |
| F5 | Notifications & Activity Feed | Notification | Hook into 4 existing flows (registration, waitlist, talks, ratings) | 117 |
- All frameworks receive the identical spec prompt
- All have a CLAUDE.md with equivalent documentation depth (~200-230 lines each)
- All use in-memory databases (H2 for Java, SQLite for TypeScript)
- All start from a minimal template (build config + empty entry point)
- Same model (Opus) does planning and orchestration for all
- Same test suite (Python HTTP tests) runs against all
- Feature rounds build on AI-generated code (not hand-written), so both the starting codebase and the additions are AI-produced under identical conditions
The results below are a snapshot in time, not a guarantee of bit-for-bit reproduction. They were produced with:
- Brace
0.1.0-SNAPSHOT- an early pre-release. The current template pinsv0.1.6; the Brace API has since changed (notably the request-param methods and the Maven/package namespace,io.brace→com.larvalabs). - Claude models
claude-opus-4-6,claude-sonnet-4-6,claude-haiku-4-5(the model IDs recorded inresults/*.json). - API pricing as of the run dates (April 2026).
Re-running today against the pinned Brace v0.1.6 and current Claude models will produce different absolute dollar figures - model versions, pricing, and run-to-run variance all move. The value of this benchmark is the relative comparison between frameworks under identical conditions and the methodology, not the exact dollar amounts.
| Brace | Spring Boot | Hono | |
|---|---|---|---|
| Tests passed (first attempt) | 33/35 | 33/35 | 33/35 |
| Tests passed (final) | 35/35 | 35/35 | 35/35 |
| Fix loop attempts | 1 | 1 | 1 |
| Total cost | $2.24 | $2.38 | $2.75 |
| Wall clock | 10.1 min | 8.4 min | 9.0 min |
| Lines of code | 562 | 914 | 518 |
All three frameworks produced nearly identical results. The 6% cost difference between Brace and Spring is within noise for a single run.
All runs used Opus for fix loops (--fix-model opus).
| Round | Feature | Brace | Spring | Hono |
|---|---|---|---|---|
| F1 | Speaker Availability | $1.01 | $1.59 | $1.48 |
| F2 | Waitlist + Auto-Promotion | $1.02 | $1.14 | $0.85 |
| F3 | Ratings & Speaker Stats | $0.75 | $1.18 | $0.91 |
| F4 | Multi-Day Events & Tracks | $1.29 | $1.96 | $1.37 |
| F5 | Notifications & Activity Feed | $1.54 | $2.29 | $1.18 |
| Cumulative (F1-F5) | $5.62 | $8.16 | $5.79 |
All F1-F5 runs achieved 100% test pass on first attempt (0 fix loops) across all three frameworks, after clarifying an ambiguous edge case in the F1 spec. F2-F5 were clean from the start. See Caveats for notes on run-to-run variance.
| Round | Saving |
|---|---|
| F1 | 36% |
| F2 | 11% |
| F3 | 36% |
| F4 | 34% |
| F5 | 33% |
Brace is cheaper on every round. The saving is fairly stable around a third (F2 is a low outlier), rather than widening monotonically - single-run variance per round is real.
| Stage | Brace (LOC / files) | Spring (LOC / files) | Hono (LOC / files) |
|---|---|---|---|
| After F1 | 862 / 16 | 1,180 / 22 | 732 / 9 |
| After F2 | 942 / 18 | 1,362 / 25 | 827 / 10 |
| After F3 | 1,060 / 20 | 1,587 / 28 | 926 / 11 |
Spring's codebase is consistently 1.4–1.5x larger than Brace's due to the repository layer, additional annotations, and service classes. This directly translates to more tokens the AI must read when adding features.
Opus autonomously decides how to delegate tasks to subagents. The model choices reveal what each framework demands:
Output tokens by model per round:
| Round | Model | Brace | Spring | Hono |
|---|---|---|---|---|
| Greenfield | Opus | 22,334 | 22,210 | 28,034 |
| Sonnet | 5,744 | 9,052 | 7,224 | |
| Haiku | 8,860 | 11,040 | 0 | |
| F1 | Opus | 9,741 | 14,424 | 16,202 |
| Sonnet | 1,244 | 1,145 | 1,206 | |
| Haiku | 1,494 | 5,542 | 5,013 | |
| F2 | Opus | 9,126 | 9,703 | 6,169 |
| Sonnet | 1,821 | 5,167 | 2,150 | |
| Haiku | 768 | 3,247 | 3,447 | |
| F3 | Opus | 5,161 | 7,785 | 7,217 |
| Sonnet | 2,314 | 8,121 | 1,841 | |
| Haiku | 1,659 | 1,442 | 4,487 | |
| F4 | Opus | 10,243 | 16,102 | 10,650 |
| Sonnet | 6,668 | 3,937 | 7,491 | |
| Haiku | 291 | 4,664 | 2,229 | |
| F5 | Opus | 12,535 | 18,702 | 9,527 |
| Sonnet | 4,565 | 4,325 | 3,463 | |
| Haiku | 1,754 | 6,672 | 4,154 |
Key patterns:
- Opus (the cost driver): Spring consistently requires more Opus tokens - by F5 it's using 49% more than Brace (18,702 vs 12,535). The orchestrator has to read and reason about a larger, more layered codebase on every turn.
- Haiku (cheapest model): In the greenfield build, Brace delegated heavily to Haiku (8,860 tokens) - simple lambda handlers and entity classes were straightforward enough for the cheapest model. Hono used no Haiku at all - Opus kept everything on Opus and Sonnet. In feature additions, Haiku usage drops across all frameworks as work shifts from generating boilerplate to modifying existing logic.
- The real takeaway: The absolute savings from cheaper model delegation are small - Opus orchestration dominates cost regardless. But the relative advantage of a smaller codebase (fewer tokens for the orchestrator to read) persists regardless of pricing. The 1.5x codebase size difference between Spring and Brace is a structural property, not a pricing artefact.
Token efficiency is one axis. Runtime performance is another.
Brace and Spring are both Java - their runtime performance is comparable. Both use Hibernate for DB access in typical applications, both run on modern JVM with virtual threads. Brace's HTTP layer (Jetty 12) is 10-20% faster than Spring's (Tomcat/Undertow) on pure HTTP workloads. For DB-heavy workloads they're similar. (Detailed Brace vs Spring benchmarks)
Hono runs on JavaScript/TypeScript runtimes (Node.js, Deno, or Bun), which are fundamentally slower than Java for sustained server workloads. On TechEmpower Round 23 (identical hardware), Java frameworks handle roughly 3x the throughput of JS frameworks on realistic DB workloads - Spring at 244k req/s vs Express at 78k req/s on the Fortunes test. This is a consistent pattern across every TFB round and test type.
The tradeoff:
| Token Cost (F1-F5) | Runtime Throughput | Ecosystem | |
|---|---|---|---|
| Brace | $5.62 (31% less than Spring) | High (Java) | Java |
| Spring | $8.16 | High (Java) | Java |
| Hono | $5.79 (29% less than Spring) | Moderate (~3x slower than Java) | TypeScript |
For projects that need both token efficiency and runtime performance, Brace is the clear choice. For projects where throughput isn't the bottleneck, Hono offers comparable token savings in the TypeScript ecosystem.
At greenfield scale, framework choice accounts for ~6% cost variation - swamped by run-to-run noise. But once features accumulate and the AI must read and modify a growing codebase, a consistent gap opens up: Brace is cheaper on every round (11–36% per round), for a cumulative saving of 31% ($5.62 vs $8.16).
This makes sense mechanically: Spring's layered architecture (Controller → Service → Repository → Entity → DTO) means the AI reads and modifies more files per feature. Brace's flat structure (Controller → Entity) keeps the context smaller. The cost grows with the number of files the AI must load into context on each turn.
Hono's cumulative feature cost ($5.79) is comparable to Brace ($5.62). It benefits from both API simplicity AND extensive AI training data. However, Hono/Node.js can't match Java's runtime throughput under high concurrency (see Runtime Performance above). For projects where runtime performance matters, Brace offers most of Hono's token efficiency with Java's performance characteristics.
Our first Brace benchmark run cost $4.45 due to bugs in the CLAUDE.md - wrong parameter syntax, missing dependency, no column naming guidance. After fixing these, the cost dropped to $1.79. Bad docs cost 2.5x more than good docs. This dwarfs any framework design advantage, especially at small scale.
Hardcoded auto-increment IDs in the test suite caused fix loops to waste tokens working around ID mismatches rather than fixing real bugs. Fixing the tests to use returned IDs eliminated these wasted loops across all frameworks.
Single-run comparisons are noisy - the same ambiguous spec interpretation can cascade into 10-15 test failures. However, the variance matters less as the dataset grows. With 5 feature rounds, the per-round noise averages out and the trend is clear: Brace consistently costs less than Spring, with the gap widening.
The biggest cost driver isn't output tokens - it's the orchestrator re-reading the growing conversation on each turn. This is why codebase size matters: a larger codebase means more tokens loaded into context on every read, which compounds across the planning, execution, and fix phases.
All the feature addition results above use Opus for fix loops (--fix-model opus). Earlier greenfield runs tested Sonnet as the fix model.
Caveat: This data is from early runs that had since-fixed issues (hardcoded test IDs, weak process cleanup). Sonnet's 3-attempt struggle on Hono may have been caused by a test infrastructure bug rather than a language-specific weakness. We haven't re-run this comparison with the fixed harness.
The same bug (overlap boundary >= vs >) hit all three frameworks. The fix model changed the outcome:
| Sonnet fix cost | Sonnet attempts | Opus fix cost | Opus attempts | |
|---|---|---|---|---|
| Brace | $0.46 | 1 | $1.04 | 1 |
| Spring | $0.56 | 1 | $1.10 | 1 |
| Hono | $1.49 | 3 | $0.87 | 1 |
Sonnet fixed the Java frameworks in one attempt at roughly half the cost of Opus. On this particular run, Sonnet struggled with Hono - it regressed by removing a conflict check, then hallucinated a "stale data" diagnosis, needing 3 attempts. Opus fixed the same bug in 1 attempt. Note that Sonnet handles Hono feature development fine (see the model delegation table above, where Hono regularly delegates to Sonnet) - debugging existing code may be a harder task than generating new code, though this is a single data point and may not generalize.
We chose Opus as the fix model for F1–F5 to be conservative. In practice, the fix model barely mattered - most rounds needed zero fix loops across all frameworks. With good specs and docs, first-attempt success is the norm and the fix model is rarely invoked.
- Java 21+
- Maven
- Node.js 18+ and npm (for Hono)
- Python 3.10+ with
pip install -r tests/requirements.txt - Claude Code CLI
The Brace track depends on com.larvalabs:brace:0.1.6, which is not published to Maven Central. Before running the Brace benchmark, clone larvalabs/brace, check out the matching tag, and install it to your local Maven repository:
git clone https://github.com/larvalabs/brace.git
cd brace && git checkout v0.1.6 && mvn install -DskipTestsThe template pins v0.1.6 (the latest release at the time this harness was last updated). To run against a newer Brace, bump the version in brace-template/pom.xml and re-check the brace-template/CLAUDE.md against the current API - the request-param methods, in particular, have changed across releases.
# Full greenfield benchmark (3 runs, all frameworks, parallel)
./run.sh
# Single greenfield run, single framework
./run.sh 1 --brace
# Feature 1 (Speaker Availability) - requires completed greenfield
./run-feature.sh 1 --fix-model opus
# Feature chain (F2-F5) - each builds on previous
./run-chain.sh 2 3 4 5 --fix-model opus
# Single feature, single framework
./run-chain.sh 4 --spring --fix-model opus
# Custom fix model
./run-chain.sh 2 3 --brace --fix-model sonnetresults/<framework>-run<N>.json- greenfield metricsresults/<framework>-feature-run<N>.json- F1 metricsresults/<framework>-feature<N>-run<M>.json- F2-F5 metricswork/<framework>-<phase>-run<N>/- generated code, plans, execution logs
spec.md # Greenfield prompt (identical for all)
feature-spec.md # F1: Speaker Availability
feature-spec-2-waitlist.md # F2: Waitlist + Auto-Promotion
feature-spec-3-ratings.md # F3: Ratings & Speaker Stats
feature-spec-4-multiday.md # F4: Multi-Day Events & Tracks
feature-spec-5-notifications.md # F5: Notifications & Activity Feed
tests/
test_conference.py # 35 base tests
test_feature_availability.py # +15 (F1)
test_feature_waitlist.py # +13 (F2)
test_feature_ratings.py # +17 (F3)
test_feature_multiday.py # +21 (F4)
test_feature_notifications.py # +16 (F5)
requirements.txt
brace-template/ # Starting point for Brace
spring-template/ # Starting point for Spring Boot
hono-template/ # Starting point for Hono
run.sh # Greenfield benchmark
run-feature.sh # F1 benchmark
run-chain.sh # F2-F5 chained benchmark
results/ # JSON output from each run
- Most comparisons are single-run per feature round - variance is real, though 5 data points per framework provide a trend
- A novel framework (Brace) vs well-known ones (Spring, Hono) - AI has extensive Spring/Hono training data but only CLAUDE.md for Brace. This disadvantages Brace, making the cost savings more notable
- Cost depends on API pricing which changes over time
- The Conference Manager is a mid-complexity CRUD app - results may differ for very different app types (heavy computation, real-time systems, etc.)
- All three frameworks reached 100% test pass rate eventually - the difference is in cost to get there, not capability