Import FEV-bench results + CD-diagram plot + Elo table #18
Open
huikan-tfc wants to merge 15 commits into
Open
Conversation
Downloads upstream FEV-bench CSV results from autogluon/fev and converts them to a benchopt-schema parquet so they can be visualized alongside native runs in the benchopt HTML report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
plot_cd_grid.py: Demšar (2006) Critical Difference diagrams — one subplot per objective_* metric, with a greedy biclique fallback for sparse benchmarks. Wraps scikit_posthocs.critical_difference_diagram. README.md: usage + flags for import_fev_bench.py and plot_cd_grid.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the CD-diagram script's location with benchopt's `plots/` convention. README updated to reflect the new path; the file remains a standalone CLI (not a benchopt BasePlot subclass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Plot(BasePlot) class to plots/plot_cd_grid.py so the CD diagram appears in the benchopt HTML "Chart type" dropdown alongside Objective curve / Bar chart / Boxplot / Table. The objective_column selector switches metrics in place. CLI usage is unchanged. Resolves the conflict that `plots/` is reserved by benchopt for BasePlot subclasses — every .py there must define a Plot class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
plots/elo.py — Bradley-Terry MLE rating + 200-round bootstrap 95% CI, rendered as a benchopt table-type custom plot. Appears in the HTML "Chart type" dropdown alongside the CD diagram. Methodology follows TabArena (Erickson et al. 2025, arXiv:2506.16791), which itself follows Chatbot Arena (Chiang et al. 2024): - pairwise wins/losses on the chosen objective_column (lower-is-better) - BT log-ratings fit via L-BFGS-B with analytic gradient and stable log_expit - 95% CI from bootstrap over datasets, mean Elo anchored at 1000 - 400-Elo gap convention (= 91% expected win rate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design change request: - 7 columns → 3 (Solver, Elo, Games) - Elo cell: point estimate on top, signed CI deltas underneath (yellow −low, green +high) — uses HTML in the cell because benchopt's table renderer calls td.innerHTML - Click any column header to sort; numeric columns parse the leading Elo value out of the multi-line cell - Wired via <img onerror> side-channel since benchopt's table builds <th> with innerText (no onclick hook) and innerHTML doesn't execute <script> tags Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per design request — replaces the table view with a benchopt bar_chart: - Bar height = Elo point estimate (median of [ci_low, elo, ci_high]) - Three horizontal-line markers per bar at ci_low / elo / ci_high render as the candle low/high ticks - Per-solver colour via BasePlot.get_style() - Sorted left-to-right by Elo descending Required dropping the text label on each bar — benchopt's JS conditionals the scatter markers on text === '' (result.js:185-204), so a non-empty text suppresses the CI ticks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For forecasting benchmarks the canonical reference baseline is the seasonal naive forecaster. Anchoring it at Elo=1000 makes the leaderboard interpretable across metrics and re-runs: better-than- seasonal-naive solvers sit above 1000, worse-than-seasonal-naive solvers below. Falls back to plain "Naive" if no seasonal variant is present, and finally to the mean-anchor convention if no naive baseline exists at all. Same convention TabArena (arXiv:2506.16791) uses with default RandomForest = 1000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a synthetic objective_column value "global" to the Critical Difference Diagram dropdown. When selected: - For every numeric objective_* metric M, build per-dataset ranks (respecting HIGHER_IS_BETTER direction) - Intersect the solver set across metrics - Concatenate (k × N_M) rank matrices along axis 1: each (metric, dataset) pair becomes one Friedman block - Friedman + Nemenyi + Demšar CD on the stacked rank matrix Injection is via overriding BasePlot._get_all_plots so the "global" entry appears in both the dropdown options list and the plots-by-key cache that the HTML frontend reads. On FEV (k=21, 7 metrics, 91 datasets) this yields 637 blocks and CD≈1.24, ~3x tighter than any single-metric CD because of the larger block count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three additions to make this benchmark useful for cross-comparison with the broader TSFM benchmarking ecosystem:
scripts/import_fev_bench.py— downloads upstream FEV-bench CSV results fromautogluon/fevand rewrites them as a single benchopt-schema parquet (21 solvers × 100 forecasting tasks). Idempotent caching; GitHub Contents API discovery with a fallback list; CLI flags for--out/--cache-dir/--no-download/--force.plots/plot_cd_grid.py— Demšar (2006) Critical Difference diagrams. Two front-ends share the same Friedman + Nemenyi computation:objective_*column into a single subplot grid PNG.Plot(BasePlot)class registers the CD diagram as a new "Chart type" in the per-run HTML report, with theobjective_columnselector switching metrics in place.plots/elo.py— Elo leaderboard as a benchopt table-type custom plot. Bradley-Terry MLE rating fit with bootstrap 95% CI; methodology follows TabArena (Erickson et al. 2025, arXiv:2506.16791) which itself follows Chatbot Arena (Chiang et al. 2024). 400-Elo gap = 91% expected win rate; mean Elo anchored at 1000.scripts/README.md— usage for the importer + CD diagram script, schema mapping for the importer, notes on the greedy-biclique trim for sparse benchmarks.Once imported, FEV results render alongside native runs:
python scripts/import_fev_bench.py benchopt plot . --html --no-display -f outputs/fev_bench_results.parquetThe HTML "Chart type" dropdown then offers:
Headline result on FEV —
objective_sqlBoth plots arrive at the same scientific conclusion — the top-3 foundation models form a statistically indistinguishable cluster — but expose it through different visual encodings.
Test plan
python scripts/import_fev_bench.py --no-downloadon cached CSVs produces 2,085 rows × 21 solvers × 100 taskspython scripts/import_fev_bench.py(live fetch) discovers models via GitHub API and downloads in parallelbenchopt plot . --html --no-displaysucceeds with both custom Plot classes in placeobjective_columnselector switches metrics in place for both custom plots; tables/diagrams re-render correctlypython plots/plot_cd_grid.py outputs/fev_bench_results.parquet🤖 Generated with Claude Code