-
Notifications
You must be signed in to change notification settings - Fork 0
Arena
/arena runs multiple models against each other on the same prompt. Three modes. UI lives in ui/multi_arena.py.
Comparing system prompts against each other on the same model is a different feature — see Prompt-Arena.
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ ▌ QUICK COMPARE │
│ Single prompt → all models stream in parallel → live grid display. │
│ │
│ ▌ BATTLE │
│ Multi-round manual evaluation. After each round you vote on the best │
│ response; running scoreboard tracks wins. Blind mode shuffles model │
│ identities and reveals them at the end. │
│ │
│ ▌ TOURNAMENT │
│ Automated evaluation with a judge model of your choice. Pick from 5 │
│ built-in suites or supply custom prompts. Suite-specific judging. │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Up to 6 models side-by-side. All modes support blind eval, system prompts, TTFT tracking, and markdown export.
One prompt, all selected models stream in parallel into a grid. Each cell shows the response, decode TPS, TTFT, and total tokens as the stream lands.
┌─ qwen3-9b ──────────┬─ llama-3.3-70b ──────┬─ phi-4 ──────────────┐
│ Paris is the… │ Paris is the capital │ The capital of │
│ │ of France, located… │ France is Paris… │
│ │ │ │
│ 32 tok · 0.7s │ 41 tok · 1.2s │ 28 tok · 0.5s │
│ 45.7 t/s · 142ms │ 34.2 t/s · 380ms │ 56.0 t/s · 95ms │
└─────────────────────┴──────────────────────┴──────────────────────┘
Best for: quick spot-checks, comparing latency / streaming feel, sniff-testing a new model.
Multi-round, human-judged. After each round you press 1–N to pick the best response. A running scoreboard shows wins per model.
Round 3 of 10
──────────────────────────────────────────────────────────
Score so far:
model-A ███████░░░░░░░░░░ 4
model-B █████░░░░░░░░░░░░ 3
model-C ████░░░░░░░░░░░░░ 2
model-D ░░░░░░░░░░░░░░░░░ 0
──────────────────────────────────────────────────────────
In blind battle, model identities are replaced with A / B / C during the round. Names reveal only at the end. This removes bias from logos / vendors / sizes — you'll be surprised how often the "small" model wins.
Fully automated. You pick a judge model and a suite, then every prompt in the suite is run against every selected model and judged.
| Suite | What's in it | Judging criteria |
|---|---|---|
| Reasoning | Logic puzzles, math word problems, deduction chains | Correctness |
| Coding | Bug fixes, refactors, algorithm questions | Correctness + edge cases |
| Creative | Open-ended writing, brainstorming | Variety + coherence |
| Instruction | Multi-constraint requests with hidden rules | Constraint compliance |
| Analysis | Document summarization, fact extraction | Accuracy + completeness |
You can also supply a list of custom prompts.
The critical thing: coding prompts are judged on correctness, not 'creativity' or 'tone'. The judge model is given suite-specific criteria as part of its system prompt:
[Coding suite judge prompt — paraphrased]
You are evaluating two code responses. Judge ONLY on:
1. Does the code correctly solve the stated problem?
2. Are edge cases handled?
3. Are there subtle bugs?
Do NOT reward 'effort', 'verbosity', or 'creative approaches'.
Pick A, B, or TIE.
This is why tournament results look different from quick-compare gut feels — the judge isn't picking the response it likes, it's picking the one that's correct.
Final output:
Tournament · Coding suite · judge=qwen3-32b · 5 models × 12 prompts
────────────────────────────────────────────────────────────────────────
Rank Model Wins WR Avg TPS Avg TTFT
────────────────────────────────────────────────────────────────────────
1 llama-3.3-70b-instruct 39 65.0% 31.4 t/s 410 ms
2 qwen3.6-27b 32 53.3% 44.1 t/s 280 ms
3 phi-4 28 46.7% 56.0 t/s 140 ms
4 mistral-large-2407 24 40.0% 22.8 t/s 580 ms
5 gemma-2-27b 17 28.3% 28.3 t/s 220 ms
────────────────────────────────────────────────────────────────────────
WR = win rate. Avg TPS and Avg TTFT are decode-only / first-token (same definitions as Chat#performance-metrics).
Discovery returns a flat list of (server, model) pairs, so arena can mix models across servers — useful for comparing your local Ollama against a remote vLLM box on the same prompt.
Every arena run dumps a full markdown transcript: every prompt, every response, every judge rationale, the final leaderboard. File names look like arena_20260403_071611.md. These are the artifact you point at when someone asks "why did you pick model X".
Getting started
Features
Internals
Operating