Skip to content

Import FEV-bench results + CD-diagram plot + Elo table #18

Open
huikan-tfc wants to merge 15 commits into
mainfrom
feat/import-fev-results
Open

Import FEV-bench results + CD-diagram plot + Elo table #18
huikan-tfc wants to merge 15 commits into
mainfrom
feat/import-fev-results

Conversation

@huikan-tfc
Copy link
Copy Markdown

@huikan-tfc huikan-tfc commented May 29, 2026

Summary

Three additions to make this benchmark useful for cross-comparison with the broader TSFM benchmarking ecosystem:

  • scripts/import_fev_bench.py — downloads upstream FEV-bench CSV results from autogluon/fev and rewrites them as a single benchopt-schema parquet (21 solvers × 100 forecasting tasks). Idempotent caching; GitHub Contents API discovery with a fallback list; CLI flags for --out / --cache-dir / --no-download / --force.

  • plots/plot_cd_grid.py — Demšar (2006) Critical Difference diagrams. Two front-ends share the same Friedman + Nemenyi computation:

    • CLI: bundles every numeric objective_* column into a single subplot grid PNG.
    • benchopt custom plot: a Plot(BasePlot) class registers the CD diagram as a new "Chart type" in the per-run HTML report, with the objective_column selector switching metrics in place.
  • plots/elo.py — Elo leaderboard as a benchopt table-type custom plot. Bradley-Terry MLE rating fit with bootstrap 95% CI; methodology follows TabArena (Erickson et al. 2025, arXiv:2506.16791) which itself follows Chatbot Arena (Chiang et al. 2024). 400-Elo gap = 91% expected win rate; mean Elo anchored at 1000.

  • scripts/README.md — usage for the importer + CD diagram script, schema mapping for the importer, notes on the greedy-biclique trim for sparse benchmarks.

Once imported, FEV results render alongside native runs:

python scripts/import_fev_bench.py
benchopt plot . --html --no-display -f outputs/fev_bench_results.parquet

The HTML "Chart type" dropdown then offers:

  • Objective curve (built-in)
  • Bar chart / Boxplot / Table (built-in)
  • Critical difference diagram (Demšar 2006 — equivalence cliques via Friedman + Nemenyi)
  • Elo (TabArena 2025 — Bradley-Terry rating with bootstrap 95% CI as a table)

Headline result on FEV — objective_sql

Method Top-1 by mean rank (CD diagram) Top-1 by Elo
Chronos-2 rank 3.0, CD = 3.28 Elo 1507, 95% CI [1379, 1529]
TimesFM-2.5 rank 4.5 (tied with Chronos-2) Elo 1414, CI [1292, 1450]
TiRex rank 4.6 (tied) Elo 1409, CI [1293, 1437]

Both plots arrive at the same scientific conclusion — the top-3 foundation models form a statistically indistinguishable cluster — but expose it through different visual encodings.

Test plan

  • python scripts/import_fev_bench.py --no-download on cached CSVs produces 2,085 rows × 21 solvers × 100 tasks
  • python scripts/import_fev_bench.py (live fetch) discovers models via GitHub API and downloads in parallel
  • benchopt plot . --html --no-display succeeds with both custom Plot classes in place
  • "Critical difference diagram" appears in the Chart type dropdown alongside the 4 built-in plots
  • "Elo" appears in the Chart type dropdown
  • objective_column selector switches metrics in place for both custom plots; tables/diagrams re-render correctly
  • CLI grid PNG still works: python plots/plot_cd_grid.py outputs/fev_bench_results.parquet
  • Elo bootstrap (N=200) completes in <10s for FEV (k=21, N=91)

🤖 Generated with Claude Code

huikan-tfc and others added 5 commits May 29, 2026 10:44
Downloads upstream FEV-bench CSV results from autogluon/fev and converts
them to a benchopt-schema parquet so they can be visualized alongside
native runs in the benchopt HTML report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
plot_cd_grid.py: Demšar (2006) Critical Difference diagrams — one subplot
per objective_* metric, with a greedy biclique fallback for sparse
benchmarks. Wraps scikit_posthocs.critical_difference_diagram.

README.md: usage + flags for import_fev_bench.py and plot_cd_grid.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the CD-diagram script's location with benchopt's `plots/` convention.
README updated to reflect the new path; the file remains a standalone CLI
(not a benchopt BasePlot subclass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Plot(BasePlot) class to plots/plot_cd_grid.py so the CD diagram
appears in the benchopt HTML "Chart type" dropdown alongside Objective
curve / Bar chart / Boxplot / Table. The objective_column selector
switches metrics in place. CLI usage is unchanged.

Resolves the conflict that `plots/` is reserved by benchopt for BasePlot
subclasses — every .py there must define a Plot class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
plots/elo.py — Bradley-Terry MLE rating + 200-round bootstrap 95% CI,
rendered as a benchopt table-type custom plot. Appears in the HTML
"Chart type" dropdown alongside the CD diagram.

Methodology follows TabArena (Erickson et al. 2025, arXiv:2506.16791),
which itself follows Chatbot Arena (Chiang et al. 2024):
- pairwise wins/losses on the chosen objective_column (lower-is-better)
- BT log-ratings fit via L-BFGS-B with analytic gradient and stable log_expit
- 95% CI from bootstrap over datasets, mean Elo anchored at 1000
- 400-Elo gap convention (= 91% expected win rate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@huikan-tfc huikan-tfc changed the title Import FEV-bench results + CD-diagram plot Import FEV-bench results + CD diagram + Elo custom plots May 29, 2026
@huikan-tfc huikan-tfc changed the title Import FEV-bench results + CD diagram + Elo custom plots Import FEV-bench results + CD-diagram plot + Elo table May 29, 2026
huikan-tfc and others added 10 commits May 29, 2026 11:50
Design change request:
- 7 columns → 3 (Solver, Elo, Games)
- Elo cell: point estimate on top, signed CI deltas underneath
  (yellow −low, green +high) — uses HTML in the cell because
  benchopt's table renderer calls td.innerHTML
- Click any column header to sort; numeric columns parse the
  leading Elo value out of the multi-line cell
- Wired via <img onerror> side-channel since benchopt's table
  builds <th> with innerText (no onclick hook) and innerHTML
  doesn't execute <script> tags

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per design request — replaces the table view with a benchopt bar_chart:

- Bar height = Elo point estimate (median of [ci_low, elo, ci_high])
- Three horizontal-line markers per bar at ci_low / elo / ci_high render
  as the candle low/high ticks
- Per-solver colour via BasePlot.get_style()
- Sorted left-to-right by Elo descending

Required dropping the text label on each bar — benchopt's JS conditionals
the scatter markers on text === '' (result.js:185-204), so a non-empty
text suppresses the CI ticks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For forecasting benchmarks the canonical reference baseline is the
seasonal naive forecaster. Anchoring it at Elo=1000 makes the
leaderboard interpretable across metrics and re-runs: better-than-
seasonal-naive solvers sit above 1000, worse-than-seasonal-naive
solvers below.

Falls back to plain "Naive" if no seasonal variant is present, and
finally to the mean-anchor convention if no naive baseline exists
at all. Same convention TabArena (arXiv:2506.16791) uses with
default RandomForest = 1000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a synthetic objective_column value "global" to the Critical Difference
Diagram dropdown. When selected:

- For every numeric objective_* metric M, build per-dataset ranks
  (respecting HIGHER_IS_BETTER direction)
- Intersect the solver set across metrics
- Concatenate (k × N_M) rank matrices along axis 1: each (metric, dataset)
  pair becomes one Friedman block
- Friedman + Nemenyi + Demšar CD on the stacked rank matrix

Injection is via overriding BasePlot._get_all_plots so the "global" entry
appears in both the dropdown options list and the plots-by-key cache that
the HTML frontend reads.

On FEV (k=21, 7 metrics, 91 datasets) this yields 637 blocks and CD≈1.24,
~3x tighter than any single-metric CD because of the larger block count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants