ENH: Support for GiftEval and FEV-Bench#17
Open
eddardd wants to merge 16 commits into
Open
Conversation
Refactors the forecasting ``predict`` contract from a per-context call
into a single call covering all series and all rolling cutoffs at once.
This matches how hosted-API solvers (TFC, etc.) natively dispatch work
and avoids re-paying per-call overhead in the objective.
New signature::
predict(
x: list[np.ndarray (T_i, C)],
cutoff_indexes: list[list[int]],
covariates: {"static_covars": list, "hist_covars": list,
"future_covars": list},
horizon: int,
) -> list[np.ndarray (n_cutoffs_i, horizon, C)]
The ``covariates`` dict always has all three keys (empty lists when a
dataset doesn't carry them), so adapters never branch on None vs dict.
Changes:
- ``benchmark_utils/adapters/base.py``: rewrite the predict contract and
documentation.
- ``benchmark_utils/windowing.py``: ``make_forecasting_splits`` now
returns ``(series_full, cutoff_indexes, targets)`` with targets of
shape ``(n_cutoffs_i, H, C)``.
- ``datasets/monash.py``: emits ``cutoff_indexes`` and empty
``covariates`` alongside the existing fields.
- ``objective.py``: forwards the new fields and reshapes the batched
prediction back to flat ``(n_total, H, C)`` arrays for metric
computation. ``get_one_result`` updated accordingly.
- ``solvers/naive.py``: ``_NaiveForecaster`` takes the batched API
(and no longer needs ``prediction_length`` in its constructor).
- ``solvers/chronos.py``: ``_ChronosForecaster`` takes the batched API
and reuses the loaded pipeline across all series and cutoffs.
- ``benchmark_utils/adapters/forecast_residual.py``: rewritten as a
single batched call so AD scoring is one prediction per series rather
than O(T) per series.
- ``solvers/tfc_api.py``: new solver that wraps the TFC hosted-API SDK.
Uses ``client.cross_validate`` to issue one request per series with
all cutoffs at once. Knobs for ``model``, ``context``,
``add_holidays``, ``add_events``, ``country_isocode``, ``batch_size``.
Skips when ``TFC_API_KEY`` is unset.
Verification — Monash[m1_yearly_dataset, debug=True], -j 1:
| solver | MAE | MSE | MASE | sMAPE |
| --------------------- | ---------- | ----------- | ------ | ------ |
| Naive[seasonality=1] | 3,399,506 | 5.93e13 | 12.86 | 0.431 |
| TFC-API[chronos-2] | 2,807,424 | 4.07e13 | 10.62 | 0.349 |
| TFC-API[tabpfn-ts] | 2,621,979 | 3.69e13 | 9.92 | 0.401 |
| TFC-API[timesfm-2p5] | 2,657,678 | 3.99e13 | 10.05 | 0.263 |
The Chronos numbers match bit-for-bit against the pre-refactor run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repeats the last ``season_length`` observations to fill the horizon. Default parameter sweep covers ``[1, 7, 12, 24]`` (last-value persistence, weekly, monthly, daily seasonal periods). Useful as a calibrated baseline whose strength depends entirely on matching the seasonal period to the data — handy for sanity-checking the impact of seasonality on TSFMs at fixed compute. Verified on Monash[m1_yearly_dataset, debug=True]: | season_length | MAE | MSE | MASE | sMAPE | | ------------- | ---------- | --------- | ------ | ----- | | 1 | 3,399,506 | 5.93e13 | 12.86 | 0.431 | | 7 | 3,045,677 | 4.31e13 | 11.52 | 0.573 | | 12 | 4,526,063 | 9.24e13 | 17.12 | 0.948 | | 24 | 6,230,975 | 1.71e14 | 23.56 | 1.744 | Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes bundled because they touch the same predict signature: 1. Rename ``horizon`` → ``prediction_length`` in the forecasting predict() contract for consistency with the SDK and dataset metadata. 2. TFC-API solver now sends one ``cross_validate`` call covering every series when the model reports ``supports_batching == True`` (chronos-2, moirai-2, T0-1535, T0-1638). Series are aligned to share an end date so all cutoffs collapse to a common ``fcds`` list; the SDK then stacks them into the (V, T) tensor Chronos-2 wants, with one ``unique_id`` per series-channel acting as the group id. Falls back to the per-series loop when cutoff offsets from end aren't homogeneous across series (e.g. a mix of n_windows after some series were filtered for being too short). Touched files for the rename: base.py, objective.py, forecast_residual.py, naive.py, chronos.py, seasonal_naive.py, tfc_api.py. Verification — Monash[m1_yearly_dataset, debug=True], -j 1: - chronos-2 (batched): MAE 2,785,573 · MASE 10.53 · sMAPE 0.348 (vs per-series: MAE 2,807,424 · MASE 10.62 — same order, ~0.8% delta is just batched-vs-sequential sampling variance.) - timesfm-2p5 (per-series, not batching-capable): unchanged at MAE 2,657,678 · MASE 10.05. Routing verified directly: - Chronos_2.supports_batching == True → batched path - Moirai2.supports_batching == True → batched path - TimesFM_2p5.supports_batching == False → per-series path - TabPFN_TS.supports_batching == False → per-series path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… at dataset level Tightens the forecasting predict() contract introduced earlier in this PR: - New ``benchmark_utils.inputs.ForecastInput`` frozen dataclass bundles ``x``, ``cutoff_indexes``, and ``covariates``. The base ``predict`` signature is now ``predict(self, x: ForecastInput | np.ndarray)`` — forecasting adapters take the dataclass, classification / anomaly- detection adapters take a plain ndarray. No more ``*args/**kwargs``. - New ``benchmark_utils.covariates.Covariates`` frozen dataclass with ``static_covars / hist_covars / future_covars`` fields, each defaulting to an empty ``Sequence`` (so arrays work as well as lists). - ``prediction_length`` is removed from the predict signature. It is dataset-level state — the solver reads it from ``meta`` once and wires it into the adapter constructor. This keeps predict() pure per-call. Updated to the new contract: base adapter, objective (both ``_eval_forecasting`` and ``get_one_result``'s constant adapter), Monash dataset (now emits ``Covariates()``), Naive, SeasonalNaive, Chronos, ForecastResidual, TFC-API. Parity preserved on Monash[m1_yearly_dataset, debug=True]: - Naive[seasonality=1]: MAE 3,399,506 · MASE 12.86 · sMAPE 0.431 - SeasonalNaive[season_length=1]: identical to Naive[seasonality=1] ✓ - TFC-API[chronos-2] (batched): MAE 2,785,573 · MASE 10.53 - TFC-API[timesfm-2p5]: MAE 2,657,678 · MASE 10.05 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Forecasting predict() now returns ``Sequence[ForecastOutput]`` instead
of a list of raw point arrays. ``ForecastOutput`` is a frozen dataclass
holding:
- ``quantiles``: ndarray with shape ``(n_cutoffs, Q, prediction_length, C)``.
- ``quantile_levels``: tuple of floats in (0, 1), length Q.
Point forecasters (Naive, SeasonalNaive, Chronos) set
``quantile_levels=(0.5,)`` and Q=1. The TFC-API adapter now discovers
every ``<model>_q{level}`` column the SDK returns and stacks them into
``quantiles`` with the matching ``quantile_levels`` tuple — falling
back to the mean column when no quantile columns are present.
``ForecastOutput.point`` returns the best point estimate for metric
computation: the median when present, otherwise the mean across
quantile levels. The objective uses that property in
``_eval_forecasting``.
Adapter contract update in ``base.py`` docstring. ``forecast_residual``
extracts ``.point`` from the wrapped forecaster.
Verified on Monash[m1_yearly_dataset, debug=True]: Naive,
SeasonalNaive, TFC-API[chronos-2] and TFC-API[timesfm-2p5] all match
their previous metrics exactly, confirming the median extraction
preserves parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inference - ``ForecastOutput`` is now a single dataclass per ``predict()`` call, not a per-series sequence. Its ``quantiles`` field is a ``Sequence[np.ndarray]`` aligned with the input series, each entry shape ``(n_cutoffs_i, Q, prediction_length, C)``. The ``quantile_levels`` tuple is shared across the batch. ``.point`` returns one ndarray per series. - Adapter signature is now ``predict(self, x: ForecastInput) -> ForecastOutput``, with that return type explicit on every forecasting predict() in the codebase. - The local Chronos solver is now Chronos-2 (matching the upstream migration on origin/main). The forecaster batches every (series, cutoff) pair into one ``Chronos2Pipeline.predict`` call — variable context lengths handled by the pipeline's left-padding — and returns the model's full 9-level quantile fan. - Updated all forecasting solvers + the constant adapter in ``get_one_result`` + ``ForecastResidualAdapter`` to the new contract. Parity verified on Monash[m1_yearly_dataset, debug=True]: Naive, SeasonalNaive, TFC-API[chronos-2], TFC-API[timesfm-2p5] match their prior metrics exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ict-api # Conflicts: # objective.py # solvers/chronos.py
- ``>=2.0`` was too loose: ``Chronos2Pipeline.predict`` with a variable-length list of tensors and the ``pipeline.quantiles`` attribute stabilized in 2.2.x (the version verified end-to-end here). Switch to ``>=2.2,<3`` so we test what we ship and a future major bump can't silently break the contract. - Drop ``pip::torch`` — ``chronos-forecasting`` already pins ``torch<3,>=2.2`` transitively, so listing it again is dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Batched forecasting API + TFC-API + SeasonalNaive + Chronos-2 batching
tomMoral
reviewed
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.