Skip to content
mtecnic edited this page May 28, 2026 · 1 revision

Arena

/arena runs multiple models against each other on the same prompt. Three modes. UI lives in ui/multi_arena.py.

Comparing system prompts against each other on the same model is a different feature — see Prompt-Arena.


Modes

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   ▌  QUICK COMPARE                                                           │
│   Single prompt → all models stream in parallel → live grid display.         │
│                                                                              │
│   ▌  BATTLE                                                                  │
│   Multi-round manual evaluation. After each round you vote on the best       │
│   response; running scoreboard tracks wins. Blind mode shuffles model        │
│   identities and reveals them at the end.                                    │
│                                                                              │
│   ▌  TOURNAMENT                                                              │
│   Automated evaluation with a judge model of your choice. Pick from 5        │
│   built-in suites or supply custom prompts. Suite-specific judging.          │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Up to 6 models side-by-side. All modes support blind eval, system prompts, TTFT tracking, and markdown export.


Quick Compare

One prompt, all selected models stream in parallel into a grid. Each cell shows the response, decode TPS, TTFT, and total tokens as the stream lands.

┌─ qwen3-9b ──────────┬─ llama-3.3-70b ──────┬─ phi-4 ──────────────┐
│ Paris is the…       │ Paris is the capital │ The capital of       │
│                     │ of France, located…  │ France is Paris…     │
│                     │                      │                      │
│ 32 tok · 0.7s       │ 41 tok · 1.2s        │ 28 tok · 0.5s        │
│ 45.7 t/s · 142ms    │ 34.2 t/s · 380ms     │ 56.0 t/s · 95ms      │
└─────────────────────┴──────────────────────┴──────────────────────┘

Best for: quick spot-checks, comparing latency / streaming feel, sniff-testing a new model.


Battle

Multi-round, human-judged. After each round you press 1N to pick the best response. A running scoreboard shows wins per model.

  Round 3 of 10
  ──────────────────────────────────────────────────────────
   Score so far:
     model-A    ███████░░░░░░░░░░  4
     model-B    █████░░░░░░░░░░░░  3
     model-C    ████░░░░░░░░░░░░░  2
     model-D    ░░░░░░░░░░░░░░░░░  0
  ──────────────────────────────────────────────────────────

Blind mode

In blind battle, model identities are replaced with A / B / C during the round. Names reveal only at the end. This removes bias from logos / vendors / sizes — you'll be surprised how often the "small" model wins.


Tournament

Fully automated. You pick a judge model and a suite, then every prompt in the suite is run against every selected model and judged.

Built-in suites

Suite What's in it Judging criteria
Reasoning Logic puzzles, math word problems, deduction chains Correctness
Coding Bug fixes, refactors, algorithm questions Correctness + edge cases
Creative Open-ended writing, brainstorming Variety + coherence
Instruction Multi-constraint requests with hidden rules Constraint compliance
Analysis Document summarization, fact extraction Accuracy + completeness

You can also supply a list of custom prompts.

Tournament judging

The critical thing: coding prompts are judged on correctness, not 'creativity' or 'tone'. The judge model is given suite-specific criteria as part of its system prompt:

[Coding suite judge prompt — paraphrased]
You are evaluating two code responses. Judge ONLY on:
1. Does the code correctly solve the stated problem?
2. Are edge cases handled?
3. Are there subtle bugs?
Do NOT reward 'effort', 'verbosity', or 'creative approaches'.
Pick A, B, or TIE.

This is why tournament results look different from quick-compare gut feels — the judge isn't picking the response it likes, it's picking the one that's correct.

Leaderboard

Final output:

  Tournament · Coding suite · judge=qwen3-32b · 5 models × 12 prompts
  ────────────────────────────────────────────────────────────────────────
   Rank  Model                  Wins  WR     Avg TPS   Avg TTFT
  ────────────────────────────────────────────────────────────────────────
   1     llama-3.3-70b-instruct  39   65.0%  31.4 t/s  410 ms
   2     qwen3.6-27b             32   53.3%  44.1 t/s  280 ms
   3     phi-4                   28   46.7%  56.0 t/s  140 ms
   4     mistral-large-2407      24   40.0%  22.8 t/s  580 ms
   5     gemma-2-27b             17   28.3%  28.3 t/s  220 ms
  ────────────────────────────────────────────────────────────────────────

WR = win rate. Avg TPS and Avg TTFT are decode-only / first-token (same definitions as Chat#performance-metrics).


Picking models from multiple servers

Discovery returns a flat list of (server, model) pairs, so arena can mix models across servers — useful for comparing your local Ollama against a remote vLLM box on the same prompt.


Export

Every arena run dumps a full markdown transcript: every prompt, every response, every judge rationale, the final leaderboard. File names look like arena_20260403_071611.md. These are the artifact you point at when someone asks "why did you pick model X".

Clone this wiki locally