Skip to content

ENH: Support for GiftEval and FEV-Bench#17

Open
eddardd wants to merge 16 commits into
benchopt:mainfrom
eddardd:feat/gift-eval-support
Open

ENH: Support for GiftEval and FEV-Bench#17
eddardd wants to merge 16 commits into
benchopt:mainfrom
eddardd:feat/gift-eval-support

Conversation

@eddardd
Copy link
Copy Markdown
Contributor

@eddardd eddardd commented May 29, 2026

No description provided.

GeoffNN and others added 15 commits May 28, 2026 16:27
Refactors the forecasting ``predict`` contract from a per-context call
into a single call covering all series and all rolling cutoffs at once.
This matches how hosted-API solvers (TFC, etc.) natively dispatch work
and avoids re-paying per-call overhead in the objective.

New signature::

    predict(
        x: list[np.ndarray (T_i, C)],
        cutoff_indexes: list[list[int]],
        covariates: {"static_covars": list, "hist_covars": list,
                     "future_covars": list},
        horizon: int,
    ) -> list[np.ndarray (n_cutoffs_i, horizon, C)]

The ``covariates`` dict always has all three keys (empty lists when a
dataset doesn't carry them), so adapters never branch on None vs dict.

Changes:
- ``benchmark_utils/adapters/base.py``: rewrite the predict contract and
  documentation.
- ``benchmark_utils/windowing.py``: ``make_forecasting_splits`` now
  returns ``(series_full, cutoff_indexes, targets)`` with targets of
  shape ``(n_cutoffs_i, H, C)``.
- ``datasets/monash.py``: emits ``cutoff_indexes`` and empty
  ``covariates`` alongside the existing fields.
- ``objective.py``: forwards the new fields and reshapes the batched
  prediction back to flat ``(n_total, H, C)`` arrays for metric
  computation. ``get_one_result`` updated accordingly.
- ``solvers/naive.py``: ``_NaiveForecaster`` takes the batched API
  (and no longer needs ``prediction_length`` in its constructor).
- ``solvers/chronos.py``: ``_ChronosForecaster`` takes the batched API
  and reuses the loaded pipeline across all series and cutoffs.
- ``benchmark_utils/adapters/forecast_residual.py``: rewritten as a
  single batched call so AD scoring is one prediction per series rather
  than O(T) per series.
- ``solvers/tfc_api.py``: new solver that wraps the TFC hosted-API SDK.
  Uses ``client.cross_validate`` to issue one request per series with
  all cutoffs at once. Knobs for ``model``, ``context``,
  ``add_holidays``, ``add_events``, ``country_isocode``, ``batch_size``.
  Skips when ``TFC_API_KEY`` is unset.

Verification — Monash[m1_yearly_dataset, debug=True], -j 1:

| solver                | MAE        | MSE         | MASE   | sMAPE  |
| --------------------- | ---------- | ----------- | ------ | ------ |
| Naive[seasonality=1]  | 3,399,506  | 5.93e13     | 12.86  | 0.431  |
| TFC-API[chronos-2]    | 2,807,424  | 4.07e13     | 10.62  | 0.349  |
| TFC-API[tabpfn-ts]    | 2,621,979  | 3.69e13     |  9.92  | 0.401  |
| TFC-API[timesfm-2p5]  | 2,657,678  | 3.99e13     | 10.05  | 0.263  |

The Chronos numbers match bit-for-bit against the pre-refactor run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repeats the last ``season_length`` observations to fill the horizon.
Default parameter sweep covers ``[1, 7, 12, 24]`` (last-value
persistence, weekly, monthly, daily seasonal periods).

Useful as a calibrated baseline whose strength depends entirely on
matching the seasonal period to the data — handy for sanity-checking
the impact of seasonality on TSFMs at fixed compute.

Verified on Monash[m1_yearly_dataset, debug=True]:

| season_length | MAE        | MSE       | MASE   | sMAPE |
| ------------- | ---------- | --------- | ------ | ----- |
|             1 | 3,399,506  | 5.93e13   | 12.86  | 0.431 |
|             7 | 3,045,677  | 4.31e13   | 11.52  | 0.573 |
|            12 | 4,526,063  | 9.24e13   | 17.12  | 0.948 |
|            24 | 6,230,975  | 1.71e14   | 23.56  | 1.744 |

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes bundled because they touch the same predict signature:

1. Rename ``horizon`` → ``prediction_length`` in the forecasting
   predict() contract for consistency with the SDK and dataset metadata.

2. TFC-API solver now sends one ``cross_validate`` call covering every
   series when the model reports ``supports_batching == True``
   (chronos-2, moirai-2, T0-1535, T0-1638). Series are aligned to share
   an end date so all cutoffs collapse to a common ``fcds`` list; the
   SDK then stacks them into the (V, T) tensor Chronos-2 wants, with
   one ``unique_id`` per series-channel acting as the group id.

   Falls back to the per-series loop when cutoff offsets from end aren't
   homogeneous across series (e.g. a mix of n_windows after some series
   were filtered for being too short).

Touched files for the rename: base.py, objective.py,
forecast_residual.py, naive.py, chronos.py, seasonal_naive.py, tfc_api.py.

Verification — Monash[m1_yearly_dataset, debug=True], -j 1:

- chronos-2 (batched): MAE 2,785,573 · MASE 10.53 · sMAPE 0.348
  (vs per-series: MAE 2,807,424 · MASE 10.62 — same order, ~0.8%
   delta is just batched-vs-sequential sampling variance.)
- timesfm-2p5 (per-series, not batching-capable): unchanged at
  MAE 2,657,678 · MASE 10.05.

Routing verified directly:
- Chronos_2.supports_batching == True  → batched path
- Moirai2.supports_batching == True    → batched path
- TimesFM_2p5.supports_batching == False → per-series path
- TabPFN_TS.supports_batching == False → per-series path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… at dataset level

Tightens the forecasting predict() contract introduced earlier in this
PR:

- New ``benchmark_utils.inputs.ForecastInput`` frozen dataclass bundles
  ``x``, ``cutoff_indexes``, and ``covariates``. The base ``predict``
  signature is now ``predict(self, x: ForecastInput | np.ndarray)`` —
  forecasting adapters take the dataclass, classification / anomaly-
  detection adapters take a plain ndarray. No more ``*args/**kwargs``.

- New ``benchmark_utils.covariates.Covariates`` frozen dataclass with
  ``static_covars / hist_covars / future_covars`` fields, each defaulting
  to an empty ``Sequence`` (so arrays work as well as lists).

- ``prediction_length`` is removed from the predict signature. It is
  dataset-level state — the solver reads it from ``meta`` once and wires
  it into the adapter constructor. This keeps predict() pure per-call.

Updated to the new contract: base adapter, objective (both
``_eval_forecasting`` and ``get_one_result``'s constant adapter), Monash
dataset (now emits ``Covariates()``), Naive, SeasonalNaive, Chronos,
ForecastResidual, TFC-API.

Parity preserved on Monash[m1_yearly_dataset, debug=True]:
- Naive[seasonality=1]:           MAE 3,399,506 · MASE 12.86 · sMAPE 0.431
- SeasonalNaive[season_length=1]: identical to Naive[seasonality=1] ✓
- TFC-API[chronos-2] (batched):   MAE 2,785,573 · MASE 10.53
- TFC-API[timesfm-2p5]:           MAE 2,657,678 · MASE 10.05

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Forecasting predict() now returns ``Sequence[ForecastOutput]`` instead
of a list of raw point arrays. ``ForecastOutput`` is a frozen dataclass
holding:

- ``quantiles``: ndarray with shape ``(n_cutoffs, Q, prediction_length, C)``.
- ``quantile_levels``: tuple of floats in (0, 1), length Q.

Point forecasters (Naive, SeasonalNaive, Chronos) set
``quantile_levels=(0.5,)`` and Q=1. The TFC-API adapter now discovers
every ``<model>_q{level}`` column the SDK returns and stacks them into
``quantiles`` with the matching ``quantile_levels`` tuple — falling
back to the mean column when no quantile columns are present.

``ForecastOutput.point`` returns the best point estimate for metric
computation: the median when present, otherwise the mean across
quantile levels. The objective uses that property in
``_eval_forecasting``.

Adapter contract update in ``base.py`` docstring. ``forecast_residual``
extracts ``.point`` from the wrapped forecaster.

Verified on Monash[m1_yearly_dataset, debug=True]: Naive,
SeasonalNaive, TFC-API[chronos-2] and TFC-API[timesfm-2p5] all match
their previous metrics exactly, confirming the median extraction
preserves parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inference

- ``ForecastOutput`` is now a single dataclass per ``predict()`` call,
  not a per-series sequence. Its ``quantiles`` field is a
  ``Sequence[np.ndarray]`` aligned with the input series, each entry
  shape ``(n_cutoffs_i, Q, prediction_length, C)``. The ``quantile_levels``
  tuple is shared across the batch. ``.point`` returns one ndarray per
  series.
- Adapter signature is now ``predict(self, x: ForecastInput) -> ForecastOutput``,
  with that return type explicit on every forecasting predict() in the
  codebase.
- The local Chronos solver is now Chronos-2 (matching the upstream
  migration on origin/main). The forecaster batches every (series,
  cutoff) pair into one ``Chronos2Pipeline.predict`` call — variable
  context lengths handled by the pipeline's left-padding — and returns
  the model's full 9-level quantile fan.
- Updated all forecasting solvers + the constant adapter in
  ``get_one_result`` + ``ForecastResidualAdapter`` to the new contract.

Parity verified on Monash[m1_yearly_dataset, debug=True]: Naive,
SeasonalNaive, TFC-API[chronos-2], TFC-API[timesfm-2p5] match their
prior metrics exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ict-api

# Conflicts:
#	objective.py
#	solvers/chronos.py
- ``>=2.0`` was too loose: ``Chronos2Pipeline.predict`` with a
  variable-length list of tensors and the ``pipeline.quantiles``
  attribute stabilized in 2.2.x (the version verified end-to-end here).
  Switch to ``>=2.2,<3`` so we test what we ship and a future major
  bump can't silently break the contract.
- Drop ``pip::torch`` — ``chronos-forecasting`` already pins
  ``torch<3,>=2.2`` transitively, so listing it again is dead weight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Batched forecasting API + TFC-API + SeasonalNaive + Chronos-2 batching
@eddardd eddardd changed the title Feat/gift eval support ENH: Support for GiftEval and FEV-Bench May 29, 2026
Copy link
Copy Markdown
Member

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Comment thread datasets/gifteval.py
Comment thread datasets/gifteval.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants