Skip to content

larvalabs/ai-benchmark

Repository files navigation

AI Token Efficiency Benchmark

Measures how many AI tokens (and dollars) it takes to build and iteratively extend the same application in different frameworks, using Claude Code in headless mode with multi-model delegation.

Contents

The Question

Does a simpler framework API surface lead to more efficient AI-assisted development? We tested this by having Claude build the same Conference Manager API in three frameworks, then add 5 features incrementally - measuring cost, speed, and correctness at each step.

The Conclusion

The short answer: framework choice matters more as complexity grows. At greenfield scale (6 entities), all frameworks are roughly tied (~6% spread). Across the five feature additions (growing to 10 entities, 117 tests), Brace costs about 31% less in cumulative feature cost than Spring Boot, and is cheaper on every round. Documentation quality and test suite quality remain the highest-leverage factors overall. Hono performs comparably to Brace on token cost ($5.79 vs $5.62), reinforcing that minimal, explicit frameworks are more token-efficient than layered, convention-heavy ones - regardless of language. The consistency across two independent minimal frameworks in different ecosystems strengthens the case against Spring's higher cost.

Frameworks

Framework Language Style Why Included
Brace Java 21 Minimal, explicit, no DI, no ORM magic Type-safe, high-performance language with an explicitly AI-optimized framework
Spring Boot 3.4 Java 21 Full-featured, annotations, Spring Data JPA Industry-standard Java framework with massive AI training data
Hono TypeScript Minimal, explicit, raw SQL via better-sqlite3 Minimal framework in a different type-safe language - tests whether simplicity or language familiarity drives efficiency

Methodology

Three-Phase Approach

  1. Planning - Claude Opus reads the spec and creates a task breakdown, assigning each task to the cheapest capable model (Haiku for boilerplate, Sonnet for business logic, Opus for architecture)
  2. Execution - Opus executes its plan, delegating tasks to subagents running in parallel where possible
  3. Fix loops - If compilation fails or tests fail, a single model fixes issues directly in the same session (no delegation - fixes need full codebase context). The fix model is configurable (--fix-model).

The App: Conference Manager

A JSON REST API starting with 6 entities, CRUD endpoints, scheduling conflict detection, room capacity enforcement, and schedule query endpoints. Five feature additions progressively grow the app to 10 entities with 117 integration tests. See spec.md for the initial build spec.

Chosen because it has PetClinic-level complexity but isn't in AI training data, and the scheduling logic requires real programming beyond basic CRUD.

Feature Addition Rounds

Each round adds a feature to the existing codebase. The AI must read and understand the current implementation, add new functionality, and ensure all existing tests continue to pass.

Round Feature New Entity Key Challenge Cumulative Tests
F1 Speaker Availability Availability Modify existing talk validation 50
F2 Waitlist + Auto-Promotion WaitlistEntry Cross-entity mutation on registration delete 63
F3 Ratings & Speaker Stats Rating Aggregate queries, modify response shapes 80
F4 Multi-Day Events & Tracks Track Modify existing entity + validation + schedule format 101
F5 Notifications & Activity Feed Notification Hook into 4 existing flows (registration, waitlist, talks, ratings) 117

Fairness

  • All frameworks receive the identical spec prompt
  • All have a CLAUDE.md with equivalent documentation depth (~200-230 lines each)
  • All use in-memory databases (H2 for Java, SQLite for TypeScript)
  • All start from a minimal template (build config + empty entry point)
  • Same model (Opus) does planning and orchestration for all
  • Same test suite (Python HTTP tests) runs against all
  • Feature rounds build on AI-generated code (not hand-written), so both the starting codebase and the additions are AI-produced under identical conditions

Provenance of the Published Numbers

The results below are a snapshot in time, not a guarantee of bit-for-bit reproduction. They were produced with:

  • Brace 0.1.0-SNAPSHOT - an early pre-release. The current template pins v0.1.6; the Brace API has since changed (notably the request-param methods and the Maven/package namespace, io.bracecom.larvalabs).
  • Claude models claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5 (the model IDs recorded in results/*.json).
  • API pricing as of the run dates (April 2026).

Re-running today against the pinned Brace v0.1.6 and current Claude models will produce different absolute dollar figures - model versions, pricing, and run-to-run variance all move. The value of this benchmark is the relative comparison between frameworks under identical conditions and the methodology, not the exact dollar amounts.

Results: Greenfield Build

Brace Spring Boot Hono
Tests passed (first attempt) 33/35 33/35 33/35
Tests passed (final) 35/35 35/35 35/35
Fix loop attempts 1 1 1
Total cost $2.24 $2.38 $2.75
Wall clock 10.1 min 8.4 min 9.0 min
Lines of code 562 914 518

All three frameworks produced nearly identical results. The 6% cost difference between Brace and Spring is within noise for a single run.

Results: Feature Additions (F1–F5)

All runs used Opus for fix loops (--fix-model opus).

Cost Per Feature

Round Feature Brace Spring Hono
F1 Speaker Availability $1.01 $1.59 $1.48
F2 Waitlist + Auto-Promotion $1.02 $1.14 $0.85
F3 Ratings & Speaker Stats $0.75 $1.18 $0.91
F4 Multi-Day Events & Tracks $1.29 $1.96 $1.37
F5 Notifications & Activity Feed $1.54 $2.29 $1.18
Cumulative (F1-F5) $5.62 $8.16 $5.79

First-Attempt Accuracy

All F1-F5 runs achieved 100% test pass on first attempt (0 fix loops) across all three frameworks, after clarifying an ambiguous edge case in the F1 spec. F2-F5 were clean from the start. See Caveats for notes on run-to-run variance.

Brace vs Spring Savings by Round

Round Saving
F1 36%
F2 11%
F3 36%
F4 34%
F5 33%

Brace is cheaper on every round. The saving is fairly stable around a third (F2 is a low outlier), rather than widening monotonically - single-run variance per round is real.

Codebase Size at Each Stage

Stage Brace (LOC / files) Spring (LOC / files) Hono (LOC / files)
After F1 862 / 16 1,180 / 22 732 / 9
After F2 942 / 18 1,362 / 25 827 / 10
After F3 1,060 / 20 1,587 / 28 926 / 11

Spring's codebase is consistently 1.4–1.5x larger than Brace's due to the repository layer, additional annotations, and service classes. This directly translates to more tokens the AI must read when adding features.

Model Delegation Patterns

Opus autonomously decides how to delegate tasks to subagents. The model choices reveal what each framework demands:

Output tokens by model per round:

Round Model Brace Spring Hono
Greenfield Opus 22,334 22,210 28,034
Sonnet 5,744 9,052 7,224
Haiku 8,860 11,040 0
F1 Opus 9,741 14,424 16,202
Sonnet 1,244 1,145 1,206
Haiku 1,494 5,542 5,013
F2 Opus 9,126 9,703 6,169
Sonnet 1,821 5,167 2,150
Haiku 768 3,247 3,447
F3 Opus 5,161 7,785 7,217
Sonnet 2,314 8,121 1,841
Haiku 1,659 1,442 4,487
F4 Opus 10,243 16,102 10,650
Sonnet 6,668 3,937 7,491
Haiku 291 4,664 2,229
F5 Opus 12,535 18,702 9,527
Sonnet 4,565 4,325 3,463
Haiku 1,754 6,672 4,154

Key patterns:

  • Opus (the cost driver): Spring consistently requires more Opus tokens - by F5 it's using 49% more than Brace (18,702 vs 12,535). The orchestrator has to read and reason about a larger, more layered codebase on every turn.
  • Haiku (cheapest model): In the greenfield build, Brace delegated heavily to Haiku (8,860 tokens) - simple lambda handlers and entity classes were straightforward enough for the cheapest model. Hono used no Haiku at all - Opus kept everything on Opus and Sonnet. In feature additions, Haiku usage drops across all frameworks as work shifts from generating boilerplate to modifying existing logic.
  • The real takeaway: The absolute savings from cheaper model delegation are small - Opus orchestration dominates cost regardless. But the relative advantage of a smaller codebase (fewer tokens for the orchestrator to read) persists regardless of pricing. The 1.5x codebase size difference between Spring and Brace is a structural property, not a pricing artefact.

Runtime Performance

Token efficiency is one axis. Runtime performance is another.

Brace and Spring are both Java - their runtime performance is comparable. Both use Hibernate for DB access in typical applications, both run on modern JVM with virtual threads. Brace's HTTP layer (Jetty 12) is 10-20% faster than Spring's (Tomcat/Undertow) on pure HTTP workloads. For DB-heavy workloads they're similar. (Detailed Brace vs Spring benchmarks)

Hono runs on JavaScript/TypeScript runtimes (Node.js, Deno, or Bun), which are fundamentally slower than Java for sustained server workloads. On TechEmpower Round 23 (identical hardware), Java frameworks handle roughly 3x the throughput of JS frameworks on realistic DB workloads - Spring at 244k req/s vs Express at 78k req/s on the Fortunes test. This is a consistent pattern across every TFB round and test type.

The tradeoff:

Token Cost (F1-F5) Runtime Throughput Ecosystem
Brace $5.62 (31% less than Spring) High (Java) Java
Spring $8.16 High (Java) Java
Hono $5.79 (29% less than Spring) Moderate (~3x slower than Java) TypeScript

For projects that need both token efficiency and runtime performance, Brace is the clear choice. For projects where throughput isn't the bottleneck, Hono offers comparable token savings in the TypeScript ecosystem.

Key Findings

Framework choice matters more as complexity grows

At greenfield scale, framework choice accounts for ~6% cost variation - swamped by run-to-run noise. But once features accumulate and the AI must read and modify a growing codebase, a consistent gap opens up: Brace is cheaper on every round (11–36% per round), for a cumulative saving of 31% ($5.62 vs $8.16).

This makes sense mechanically: Spring's layered architecture (Controller → Service → Repository → Entity → DTO) means the AI reads and modifies more files per feature. Brace's flat structure (Controller → Entity) keeps the context smaller. The cost grows with the number of files the AI must load into context on each turn.

Hono is cost-competitive, but trades runtime performance

Hono's cumulative feature cost ($5.79) is comparable to Brace ($5.62). It benefits from both API simplicity AND extensive AI training data. However, Hono/Node.js can't match Java's runtime throughput under high concurrency (see Runtime Performance above). For projects where runtime performance matters, Brace offers most of Hono's token efficiency with Java's performance characteristics.

Documentation quality remains the highest-leverage investment

Our first Brace benchmark run cost $4.45 due to bugs in the CLAUDE.md - wrong parameter syntax, missing dependency, no column naming guidance. After fixing these, the cost dropped to $1.79. Bad docs cost 2.5x more than good docs. This dwarfs any framework design advantage, especially at small scale.

Test suite quality is the second highest-leverage investment

Hardcoded auto-increment IDs in the test suite caused fix loops to waste tokens working around ID mismatches rather than fixing real bugs. Fixing the tests to use returned IDs eliminated these wasted loops across all frameworks.

Run-to-run variance is real but diminishing

Single-run comparisons are noisy - the same ambiguous spec interpretation can cascade into 10-15 test failures. However, the variance matters less as the dataset grows. With 5 feature rounds, the per-round noise averages out and the trend is clear: Brace consistently costs less than Spring, with the gap widening.

Cache reads dominate cost

The biggest cost driver isn't output tokens - it's the orchestrator re-reading the growing conversation on each turn. This is why codebase size matters: a larger codebase means more tokens loaded into context on every read, which compounds across the planning, execution, and fix phases.

Fix Loop Model Selection

All the feature addition results above use Opus for fix loops (--fix-model opus). Earlier greenfield runs tested Sonnet as the fix model.

Greenfield: Sonnet vs Opus for fixes

Caveat: This data is from early runs that had since-fixed issues (hardcoded test IDs, weak process cleanup). Sonnet's 3-attempt struggle on Hono may have been caused by a test infrastructure bug rather than a language-specific weakness. We haven't re-run this comparison with the fixed harness.

The same bug (overlap boundary >= vs >) hit all three frameworks. The fix model changed the outcome:

Sonnet fix cost Sonnet attempts Opus fix cost Opus attempts
Brace $0.46 1 $1.04 1
Spring $0.56 1 $1.10 1
Hono $1.49 3 $0.87 1

Sonnet fixed the Java frameworks in one attempt at roughly half the cost of Opus. On this particular run, Sonnet struggled with Hono - it regressed by removing a conflict check, then hallucinated a "stale data" diagnosis, needing 3 attempts. Opus fixed the same bug in 1 attempt. Note that Sonnet handles Hono feature development fine (see the model delegation table above, where Hono regularly delegates to Sonnet) - debugging existing code may be a harder task than generating new code, though this is a single data point and may not generalize.

Why we used Opus for feature additions

We chose Opus as the fix model for F1–F5 to be conservative. In practice, the fix model barely mattered - most rounds needed zero fix loops across all frameworks. With good specs and docs, first-attempt success is the norm and the fix model is rarely invoked.

Running the Benchmark

Prerequisites

  • Java 21+
  • Maven
  • Node.js 18+ and npm (for Hono)
  • Python 3.10+ with pip install -r tests/requirements.txt
  • Claude Code CLI

The Brace track depends on com.larvalabs:brace:0.1.6, which is not published to Maven Central. Before running the Brace benchmark, clone larvalabs/brace, check out the matching tag, and install it to your local Maven repository:

git clone https://github.com/larvalabs/brace.git
cd brace && git checkout v0.1.6 && mvn install -DskipTests

The template pins v0.1.6 (the latest release at the time this harness was last updated). To run against a newer Brace, bump the version in brace-template/pom.xml and re-check the brace-template/CLAUDE.md against the current API - the request-param methods, in particular, have changed across releases.

Commands

# Full greenfield benchmark (3 runs, all frameworks, parallel)
./run.sh

# Single greenfield run, single framework
./run.sh 1 --brace

# Feature 1 (Speaker Availability) - requires completed greenfield
./run-feature.sh 1 --fix-model opus

# Feature chain (F2-F5) - each builds on previous
./run-chain.sh 2 3 4 5 --fix-model opus

# Single feature, single framework
./run-chain.sh 4 --spring --fix-model opus

# Custom fix model
./run-chain.sh 2 3 --brace --fix-model sonnet

Output

  • results/<framework>-run<N>.json - greenfield metrics
  • results/<framework>-feature-run<N>.json - F1 metrics
  • results/<framework>-feature<N>-run<M>.json - F2-F5 metrics
  • work/<framework>-<phase>-run<N>/ - generated code, plans, execution logs

Structure

spec.md                         # Greenfield prompt (identical for all)
feature-spec.md                 # F1: Speaker Availability
feature-spec-2-waitlist.md      # F2: Waitlist + Auto-Promotion
feature-spec-3-ratings.md       # F3: Ratings & Speaker Stats
feature-spec-4-multiday.md      # F4: Multi-Day Events & Tracks
feature-spec-5-notifications.md # F5: Notifications & Activity Feed
tests/
  test_conference.py            # 35 base tests
  test_feature_availability.py  # +15 (F1)
  test_feature_waitlist.py      # +13 (F2)
  test_feature_ratings.py       # +17 (F3)
  test_feature_multiday.py      # +21 (F4)
  test_feature_notifications.py # +16 (F5)
  requirements.txt
brace-template/                 # Starting point for Brace
spring-template/                # Starting point for Spring Boot
hono-template/                  # Starting point for Hono
run.sh                          # Greenfield benchmark
run-feature.sh                  # F1 benchmark
run-chain.sh                    # F2-F5 chained benchmark
results/                        # JSON output from each run

Caveats

  • Most comparisons are single-run per feature round - variance is real, though 5 data points per framework provide a trend
  • A novel framework (Brace) vs well-known ones (Spring, Hono) - AI has extensive Spring/Hono training data but only CLAUDE.md for Brace. This disadvantages Brace, making the cost savings more notable
  • Cost depends on API pricing which changes over time
  • The Conference Manager is a mid-complexity CRUD app - results may differ for very different app types (heavy computation, real-time systems, etc.)
  • All three frameworks reached 100% test pass rate eventually - the difference is in cost to get there, not capability

About

A token-efficiency benchmark: building the same app in Brace, Spring Boot, and Hono with Claude Code to measure cost, speed, and correctness as complexity grows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages