Arena

/arena runs multiple models against each other on the same prompt. Three modes. UI lives in ui/multi_arena.py.

Comparing system prompts against each other on the same model is a different feature — see Prompt-Arena.

Modes

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   ▌  QUICK COMPARE                                                           │
│   Single prompt → all models stream in parallel → live grid display.         │
│                                                                              │
│   ▌  BATTLE                                                                  │
│   Multi-round manual evaluation. After each round you vote on the best       │
│   response; running scoreboard tracks wins. Blind mode shuffles model        │
│   identities and reveals them at the end.                                    │
│                                                                              │
│   ▌  TOURNAMENT                                                              │
│   Automated evaluation with a judge model of your choice. Pick from 5        │
│   built-in suites or supply custom prompts. Suite-specific judging.          │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Up to 6 models side-by-side. All modes support blind eval, system prompts, TTFT tracking, and markdown export.

Quick Compare

One prompt, all selected models stream in parallel into a grid. Each cell shows the response, decode TPS, TTFT, and total tokens as the stream lands.

┌─ qwen3-9b ──────────┬─ llama-3.3-70b ──────┬─ phi-4 ──────────────┐
│ Paris is the…       │ Paris is the capital │ The capital of       │
│                     │ of France, located…  │ France is Paris…     │
│                     │                      │                      │
│ 32 tok · 0.7s       │ 41 tok · 1.2s        │ 28 tok · 0.5s        │
│ 45.7 t/s · 142ms    │ 34.2 t/s · 380ms     │ 56.0 t/s · 95ms      │
└─────────────────────┴──────────────────────┴──────────────────────┘

Best for: quick spot-checks, comparing latency / streaming feel, sniff-testing a new model.

Battle

Multi-round, human-judged. After each round you press 1–N to pick the best response. A running scoreboard shows wins per model.

  Round 3 of 10
  ──────────────────────────────────────────────────────────
   Score so far:
     model-A    ███████░░░░░░░░░░  4
     model-B    █████░░░░░░░░░░░░  3
     model-C    ████░░░░░░░░░░░░░  2
     model-D    ░░░░░░░░░░░░░░░░░  0
  ──────────────────────────────────────────────────────────

Blind mode

In blind battle, model identities are replaced with A / B / C during the round. Names reveal only at the end. This removes bias from logos / vendors / sizes — you'll be surprised how often the "small" model wins.

Tournament

Fully automated. You pick a judge model and a suite, then every prompt in the suite is run against every selected model and judged.

Built-in suites

Suite	What's in it	Judging criteria
Reasoning	Logic puzzles, math word problems, deduction chains	Correctness
Coding	Bug fixes, refactors, algorithm questions	Correctness + edge cases
Creative	Open-ended writing, brainstorming	Variety + coherence
Instruction	Multi-constraint requests with hidden rules	Constraint compliance
Analysis	Document summarization, fact extraction	Accuracy + completeness

You can also supply a list of custom prompts.

Tournament judging

The critical thing: coding prompts are judged on correctness, not 'creativity' or 'tone'. The judge model is given suite-specific criteria as part of its system prompt:

[Coding suite judge prompt — paraphrased]
You are evaluating two code responses. Judge ONLY on:
1. Does the code correctly solve the stated problem?
2. Are edge cases handled?
3. Are there subtle bugs?
Do NOT reward 'effort', 'verbosity', or 'creative approaches'.
Pick A, B, or TIE.

This is why tournament results look different from quick-compare gut feels — the judge isn't picking the response it likes, it's picking the one that's correct.

Leaderboard

Final output:

  Tournament · Coding suite · judge=qwen3-32b · 5 models × 12 prompts
  ────────────────────────────────────────────────────────────────────────
   Rank  Model                  Wins  WR     Avg TPS   Avg TTFT
  ────────────────────────────────────────────────────────────────────────
   1     llama-3.3-70b-instruct  39   65.0%  31.4 t/s  410 ms
   2     qwen3.6-27b             32   53.3%  44.1 t/s  280 ms
   3     phi-4                   28   46.7%  56.0 t/s  140 ms
   4     mistral-large-2407      24   40.0%  22.8 t/s  580 ms
   5     gemma-2-27b             17   28.3%  28.3 t/s  220 ms
  ────────────────────────────────────────────────────────────────────────

WR = win rate. Avg TPS and Avg TTFT are decode-only / first-token (same definitions as Chat#performance-metrics).

Picking models from multiple servers

Discovery returns a flat list of (server, model) pairs, so arena can mix models across servers — useful for comparing your local Ollama against a remote vLLM box on the same prompt.

Export

Every arena run dumps a full markdown transcript: every prompt, every response, every judge rationale, the final leaderboard. File names look like arena_20260403_071611.md. These are the artifact you point at when someone asks "why did you pick model X".

Model Chat CLI · MIT · repo · issues · No telemetry · No cloud calls · No surprises

Model Chat CLI

Getting started

Features

Internals

Operating

GitHub repo →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arena

Arena

Modes

Quick Compare

Battle

Blind mode

Tournament

Built-in suites

Tournament judging

Leaderboard

Picking models from multiple servers

Export

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Chat CLI

Clone this wiki locally