Import FEV-bench results + CD-diagram plot + Elo table by huikan-tfc · Pull Request #18 · benchopt/benchmark_tsfm

huikan-tfc · 2026-05-29T09:17:02Z

Summary

Three additions to make this benchmark useful for cross-comparison with the broader TSFM benchmarking ecosystem:

scripts/import_fev_bench.py — downloads upstream FEV-bench CSV results from autogluon/fev and rewrites them as a single benchopt-schema parquet (21 solvers × 100 forecasting tasks). Idempotent caching; GitHub Contents API discovery with a fallback list; CLI flags for --out / --cache-dir / --no-download / --force.
plots/plot_cd_grid.py — Demšar (2006) Critical Difference diagrams. Two front-ends share the same Friedman + Nemenyi computation:
- CLI: bundles every numeric objective_* column into a single subplot grid PNG.
- benchopt custom plot: a Plot(BasePlot) class registers the CD diagram as a new "Chart type" in the per-run HTML report, with the objective_column selector switching metrics in place.
plots/elo.py — Elo leaderboard as a benchopt table-type custom plot. Bradley-Terry MLE rating fit with bootstrap 95% CI; methodology follows TabArena (Erickson et al. 2025, arXiv:2506.16791) which itself follows Chatbot Arena (Chiang et al. 2024). 400-Elo gap = 91% expected win rate; mean Elo anchored at 1000.
scripts/README.md — usage for the importer + CD diagram script, schema mapping for the importer, notes on the greedy-biclique trim for sparse benchmarks.

Once imported, FEV results render alongside native runs:

python scripts/import_fev_bench.py
benchopt plot . --html --no-display -f outputs/fev_bench_results.parquet

The HTML "Chart type" dropdown then offers:

Objective curve (built-in)
Bar chart / Boxplot / Table (built-in)
Critical difference diagram (Demšar 2006 — equivalence cliques via Friedman + Nemenyi)
Elo (TabArena 2025 — Bradley-Terry rating with bootstrap 95% CI as a table)

Headline result on FEV — `objective_sql`

Method	Top-1 by mean rank (CD diagram)	Top-1 by Elo
Chronos-2	rank 3.0, CD = 3.28	Elo 1507, 95% CI [1379, 1529]
TimesFM-2.5	rank 4.5 (tied with Chronos-2)	Elo 1414, CI [1292, 1450]
TiRex	rank 4.6 (tied)	Elo 1409, CI [1293, 1437]

Both plots arrive at the same scientific conclusion — the top-3 foundation models form a statistically indistinguishable cluster — but expose it through different visual encodings.

Test plan

python scripts/import_fev_bench.py --no-download on cached CSVs produces 2,085 rows × 21 solvers × 100 tasks
python scripts/import_fev_bench.py (live fetch) discovers models via GitHub API and downloads in parallel
benchopt plot . --html --no-display succeeds with both custom Plot classes in place
"Critical difference diagram" appears in the Chart type dropdown alongside the 4 built-in plots
"Elo" appears in the Chart type dropdown
objective_column selector switches metrics in place for both custom plots; tables/diagrams re-render correctly
CLI grid PNG still works: python plots/plot_cd_grid.py outputs/fev_bench_results.parquet
Elo bootstrap (N=200) completes in <10s for FEV (k=21, N=91)

🤖 Generated with Claude Code

Downloads upstream FEV-bench CSV results from autogluon/fev and converts them to a benchopt-schema parquet so they can be visualized alongside native runs in the benchopt HTML report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

plot_cd_grid.py: Demšar (2006) Critical Difference diagrams — one subplot per objective_* metric, with a greedy biclique fallback for sparse benchmarks. Wraps scikit_posthocs.critical_difference_diagram. README.md: usage + flags for import_fev_bench.py and plot_cd_grid.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns the CD-diagram script's location with benchopt's `plots/` convention. README updated to reflect the new path; the file remains a standalone CLI (not a benchopt BasePlot subclass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a Plot(BasePlot) class to plots/plot_cd_grid.py so the CD diagram appears in the benchopt HTML "Chart type" dropdown alongside Objective curve / Bar chart / Boxplot / Table. The objective_column selector switches metrics in place. CLI usage is unchanged. Resolves the conflict that `plots/` is reserved by benchopt for BasePlot subclasses — every .py there must define a Plot class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

plots/elo.py — Bradley-Terry MLE rating + 200-round bootstrap 95% CI, rendered as a benchopt table-type custom plot. Appears in the HTML "Chart type" dropdown alongside the CD diagram. Methodology follows TabArena (Erickson et al. 2025, arXiv:2506.16791), which itself follows Chatbot Arena (Chiang et al. 2024): - pairwise wins/losses on the chosen objective_column (lower-is-better) - BT log-ratings fit via L-BFGS-B with analytic gradient and stable log_expit - 95% CI from bootstrap over datasets, mean Elo anchored at 1000 - 400-Elo gap convention (= 91% expected win rate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Design change request: - 7 columns → 3 (Solver, Elo, Games) - Elo cell: point estimate on top, signed CI deltas underneath (yellow −low, green +high) — uses HTML in the cell because benchopt's table renderer calls td.innerHTML - Click any column header to sort; numeric columns parse the leading Elo value out of the multi-line cell - Wired via <img onerror> side-channel since benchopt's table builds <th> with innerText (no onclick hook) and innerHTML doesn't execute <script> tags Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per design request — replaces the table view with a benchopt bar_chart: - Bar height = Elo point estimate (median of [ci_low, elo, ci_high]) - Three horizontal-line markers per bar at ci_low / elo / ci_high render as the candle low/high ticks - Per-solver colour via BasePlot.get_style() - Sorted left-to-right by Elo descending Required dropping the text label on each bar — benchopt's JS conditionals the scatter markers on text === '' (result.js:185-204), so a non-empty text suppresses the CI ticks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For forecasting benchmarks the canonical reference baseline is the seasonal naive forecaster. Anchoring it at Elo=1000 makes the leaderboard interpretable across metrics and re-runs: better-than- seasonal-naive solvers sit above 1000, worse-than-seasonal-naive solvers below. Falls back to plain "Naive" if no seasonal variant is present, and finally to the mean-anchor convention if no naive baseline exists at all. Same convention TabArena (arXiv:2506.16791) uses with default RandomForest = 1000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a synthetic objective_column value "global" to the Critical Difference Diagram dropdown. When selected: - For every numeric objective_* metric M, build per-dataset ranks (respecting HIGHER_IS_BETTER direction) - Intersect the solver set across metrics - Concatenate (k × N_M) rank matrices along axis 1: each (metric, dataset) pair becomes one Friedman block - Friedman + Nemenyi + Demšar CD on the stacked rank matrix Injection is via overriding BasePlot._get_all_plots so the "global" entry appears in both the dropdown options list and the plots-by-key cache that the HTML frontend reads. On FEV (k=21, 7 metrics, 91 datasets) this yields 637 blocks and CD≈1.24, ~3x tighter than any single-metric CD because of the larger block count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

huikan-tfc and others added 5 commits May 29, 2026 10:44

ENH add FEV-bench result importer

9aff0b8

Downloads upstream FEV-bench CSV results from autogluon/fev and converts them to a benchopt-schema parquet so they can be visualized alongside native runs in the benchopt HTML report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

huikan-tfc changed the title ~~Import FEV-bench results + CD-diagram plot~~ Import FEV-bench results + CD diagram + Elo custom plots May 29, 2026

huikan-tfc changed the title ~~Import FEV-bench results + CD diagram + Elo custom plots~~ Import FEV-bench results + CD-diagram plot + Elo table May 29, 2026

huikan-tfc and others added 10 commits May 29, 2026 11:50

Added tables

29477ae

Modified title and added CD visualization

0a3dcf8

Added support for lower is better metrics

7d68a9a

migrate the higher_is_better flag in the metrics and the objective

d6b704b

Final commit sprint

1aab0a6

Merge branch 'main' into feat/import-fev-results

5d19e29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import FEV-bench results + CD-diagram plot + Elo table #18

Import FEV-bench results + CD-diagram plot + Elo table #18
huikan-tfc wants to merge 15 commits into
mainfrom
feat/import-fev-results

huikan-tfc commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huikan-tfc commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline result on FEV — objective_sql

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huikan-tfc commented May 29, 2026 •

edited

Loading

Headline result on FEV — `objective_sql`