ENH: Support for GiftEval and FEV-Bench by eddardd · Pull Request #17 · benchopt/benchmark_tsfm

eddardd · 2026-05-29T09:02:20Z

No description provided.

Refactors the forecasting ``predict`` contract from a per-context call into a single call covering all series and all rolling cutoffs at once. This matches how hosted-API solvers (TFC, etc.) natively dispatch work and avoids re-paying per-call overhead in the objective. New signature:: predict( x: list[np.ndarray (T_i, C)], cutoff_indexes: list[list[int]], covariates: {"static_covars": list, "hist_covars": list, "future_covars": list}, horizon: int, ) -> list[np.ndarray (n_cutoffs_i, horizon, C)] The ``covariates`` dict always has all three keys (empty lists when a dataset doesn't carry them), so adapters never branch on None vs dict. Changes: - ``benchmark_utils/adapters/base.py``: rewrite the predict contract and documentation. - ``benchmark_utils/windowing.py``: ``make_forecasting_splits`` now returns ``(series_full, cutoff_indexes, targets)`` with targets of shape ``(n_cutoffs_i, H, C)``. - ``datasets/monash.py``: emits ``cutoff_indexes`` and empty ``covariates`` alongside the existing fields. - ``objective.py``: forwards the new fields and reshapes the batched prediction back to flat ``(n_total, H, C)`` arrays for metric computation. ``get_one_result`` updated accordingly. - ``solvers/naive.py``: ``_NaiveForecaster`` takes the batched API (and no longer needs ``prediction_length`` in its constructor). - ``solvers/chronos.py``: ``_ChronosForecaster`` takes the batched API and reuses the loaded pipeline across all series and cutoffs. - ``benchmark_utils/adapters/forecast_residual.py``: rewritten as a single batched call so AD scoring is one prediction per series rather than O(T) per series. - ``solvers/tfc_api.py``: new solver that wraps the TFC hosted-API SDK. Uses ``client.cross_validate`` to issue one request per series with all cutoffs at once. Knobs for ``model``, ``context``, ``add_holidays``, ``add_events``, ``country_isocode``, ``batch_size``. Skips when ``TFC_API_KEY`` is unset. Verification — Monash[m1_yearly_dataset, debug=True], -j 1: | solver | MAE | MSE | MASE | sMAPE | | --------------------- | ---------- | ----------- | ------ | ------ | | Naive[seasonality=1] | 3,399,506 | 5.93e13 | 12.86 | 0.431 | | TFC-API[chronos-2] | 2,807,424 | 4.07e13 | 10.62 | 0.349 | | TFC-API[tabpfn-ts] | 2,621,979 | 3.69e13 | 9.92 | 0.401 | | TFC-API[timesfm-2p5] | 2,657,678 | 3.99e13 | 10.05 | 0.263 | The Chronos numbers match bit-for-bit against the pre-refactor run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Repeats the last ``season_length`` observations to fill the horizon. Default parameter sweep covers ``[1, 7, 12, 24]`` (last-value persistence, weekly, monthly, daily seasonal periods). Useful as a calibrated baseline whose strength depends entirely on matching the seasonal period to the data — handy for sanity-checking the impact of seasonality on TSFMs at fixed compute. Verified on Monash[m1_yearly_dataset, debug=True]: | season_length | MAE | MSE | MASE | sMAPE | | ------------- | ---------- | --------- | ------ | ----- | | 1 | 3,399,506 | 5.93e13 | 12.86 | 0.431 | | 7 | 3,045,677 | 4.31e13 | 11.52 | 0.573 | | 12 | 4,526,063 | 9.24e13 | 17.12 | 0.948 | | 24 | 6,230,975 | 1.71e14 | 23.56 | 1.744 | Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes bundled because they touch the same predict signature: 1. Rename ``horizon`` → ``prediction_length`` in the forecasting predict() contract for consistency with the SDK and dataset metadata. 2. TFC-API solver now sends one ``cross_validate`` call covering every series when the model reports ``supports_batching == True`` (chronos-2, moirai-2, T0-1535, T0-1638). Series are aligned to share an end date so all cutoffs collapse to a common ``fcds`` list; the SDK then stacks them into the (V, T) tensor Chronos-2 wants, with one ``unique_id`` per series-channel acting as the group id. Falls back to the per-series loop when cutoff offsets from end aren't homogeneous across series (e.g. a mix of n_windows after some series were filtered for being too short). Touched files for the rename: base.py, objective.py, forecast_residual.py, naive.py, chronos.py, seasonal_naive.py, tfc_api.py. Verification — Monash[m1_yearly_dataset, debug=True], -j 1: - chronos-2 (batched): MAE 2,785,573 · MASE 10.53 · sMAPE 0.348 (vs per-series: MAE 2,807,424 · MASE 10.62 — same order, ~0.8% delta is just batched-vs-sequential sampling variance.) - timesfm-2p5 (per-series, not batching-capable): unchanged at MAE 2,657,678 · MASE 10.05. Routing verified directly: - Chronos_2.supports_batching == True → batched path - Moirai2.supports_batching == True → batched path - TimesFM_2p5.supports_batching == False → per-series path - TabPFN_TS.supports_batching == False → per-series path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… at dataset level Tightens the forecasting predict() contract introduced earlier in this PR: - New ``benchmark_utils.inputs.ForecastInput`` frozen dataclass bundles ``x``, ``cutoff_indexes``, and ``covariates``. The base ``predict`` signature is now ``predict(self, x: ForecastInput | np.ndarray)`` — forecasting adapters take the dataclass, classification / anomaly- detection adapters take a plain ndarray. No more ``*args/**kwargs``. - New ``benchmark_utils.covariates.Covariates`` frozen dataclass with ``static_covars / hist_covars / future_covars`` fields, each defaulting to an empty ``Sequence`` (so arrays work as well as lists). - ``prediction_length`` is removed from the predict signature. It is dataset-level state — the solver reads it from ``meta`` once and wires it into the adapter constructor. This keeps predict() pure per-call. Updated to the new contract: base adapter, objective (both ``_eval_forecasting`` and ``get_one_result``'s constant adapter), Monash dataset (now emits ``Covariates()``), Naive, SeasonalNaive, Chronos, ForecastResidual, TFC-API. Parity preserved on Monash[m1_yearly_dataset, debug=True]: - Naive[seasonality=1]: MAE 3,399,506 · MASE 12.86 · sMAPE 0.431 - SeasonalNaive[season_length=1]: identical to Naive[seasonality=1] ✓ - TFC-API[chronos-2] (batched): MAE 2,785,573 · MASE 10.53 - TFC-API[timesfm-2p5]: MAE 2,657,678 · MASE 10.05 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Forecasting predict() now returns ``Sequence[ForecastOutput]`` instead of a list of raw point arrays. ``ForecastOutput`` is a frozen dataclass holding: - ``quantiles``: ndarray with shape ``(n_cutoffs, Q, prediction_length, C)``. - ``quantile_levels``: tuple of floats in (0, 1), length Q. Point forecasters (Naive, SeasonalNaive, Chronos) set ``quantile_levels=(0.5,)`` and Q=1. The TFC-API adapter now discovers every ``<model>_q{level}`` column the SDK returns and stacks them into ``quantiles`` with the matching ``quantile_levels`` tuple — falling back to the mean column when no quantile columns are present. ``ForecastOutput.point`` returns the best point estimate for metric computation: the median when present, otherwise the mean across quantile levels. The objective uses that property in ``_eval_forecasting``. Adapter contract update in ``base.py`` docstring. ``forecast_residual`` extracts ``.point`` from the wrapped forecaster. Verified on Monash[m1_yearly_dataset, debug=True]: Naive, SeasonalNaive, TFC-API[chronos-2] and TFC-API[timesfm-2p5] all match their previous metrics exactly, confirming the median extraction preserves parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inference - ``ForecastOutput`` is now a single dataclass per ``predict()`` call, not a per-series sequence. Its ``quantiles`` field is a ``Sequence[np.ndarray]`` aligned with the input series, each entry shape ``(n_cutoffs_i, Q, prediction_length, C)``. The ``quantile_levels`` tuple is shared across the batch. ``.point`` returns one ndarray per series. - Adapter signature is now ``predict(self, x: ForecastInput) -> ForecastOutput``, with that return type explicit on every forecasting predict() in the codebase. - The local Chronos solver is now Chronos-2 (matching the upstream migration on origin/main). The forecaster batches every (series, cutoff) pair into one ``Chronos2Pipeline.predict`` call — variable context lengths handled by the pipeline's left-padding — and returns the model's full 9-level quantile fan. - Updated all forecasting solvers + the constant adapter in ``get_one_result`` + ``ForecastResidualAdapter`` to the new contract. Parity verified on Monash[m1_yearly_dataset, debug=True]: Naive, SeasonalNaive, TFC-API[chronos-2], TFC-API[timesfm-2p5] match their prior metrics exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ict-api # Conflicts: # objective.py # solvers/chronos.py

- ``>=2.0`` was too loose: ``Chronos2Pipeline.predict`` with a variable-length list of tensors and the ``pipeline.quantiles`` attribute stabilized in 2.2.x (the version verified end-to-end here). Switch to ``>=2.2,<3`` so we test what we ship and a future major bump can't silently break the contract. - Drop ``pip::torch`` — ``chronos-forecasting`` already pins ``torch<3,>=2.2`` transitively, so listing it again is dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Batched forecasting API + TFC-API + SeasonalNaive + Chronos-2 batching

tomMoral

A few comments

GeoffNN and others added 15 commits May 28, 2026 16:27

Merge remote-tracking branch 'origin/main' into refactor/batched-pred…

d910ba5

…ict-api # Conflicts: # objective.py # solvers/chronos.py

Merge pull request #1 from GeoffNN/refactor/batched-predict-api

70a2873

Batched forecasting API + TFC-API + SeasonalNaive + Chronos-2 batching

Merge branch 'main' into main

5c1876a

feat: move constants to a dedicated file

1829e41

feat: gift evall support

6689046

Merge branch 'main' into feat/gift-eval-support

5e4a675

feat: adds support for fev bench

2a4a740

minor fixes

848effb

eddardd changed the title ~~Feat/gift eval support~~ ENH: Support for GiftEval and FEV-Bench May 29, 2026

tomMoral reviewed May 29, 2026

View reviewed changes

Comment thread datasets/gifteval.py

Comment thread datasets/gifteval.py Outdated

fixes, prepare(), all behavior for gifteval and fevbench

b2dc953

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support for GiftEval and FEV-Bench#17

ENH: Support for GiftEval and FEV-Bench#17
eddardd wants to merge 16 commits into
benchopt:mainfrom
eddardd:feat/gift-eval-support

eddardd commented May 29, 2026

Uh oh!

tomMoral left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eddardd commented May 29, 2026

Uh oh!

tomMoral left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants