Clarify NaN handling for temperature-simulation metric counts by huawei-lin · Pull Request #3 · cxcscmu/SkillLearnBench

huawei-lin · 2026-06-01T18:46:47Z

Summary

This MR updates all temperature-simulation task instructions to explicitly define how final self-evaluation metrics should handle missing observed temperatures after the exact datetime + rounded-depth merge.

The current instructions require agents to:

compute metrics from an exact datetime + rounded-depth merge,
avoid nearest-time matching, interpolation, or alternative depth binning,
write overall_n_pairs, annual_deep_n_pairs, and summer_deep_n_pairs to /root/metrics.json,
match the hidden verifier's recomputed metrics closely.

However, the instructions do not currently specify whether rows with missing or NaN observed temperature values should be kept or filtered before counting *_n_pairs. This creates an ambiguity because the hidden verifier counts raw merged rows for *_n_pairs, while pandas/numpy RMSE arithmetic naturally ignores NaN errors when computing the mean.

This MR resolves the ambiguity by stating that agents must:

perform the exact merge without dropping rows from observations or from the merged table,
count raw merged rows in each subset for overall_n_pairs, annual_deep_n_pairs, and summer_deep_n_pairs,
compute RMSE from the raw merged error series using standard pandas/numpy mean semantics, where NaN errors do not contribute to the mean but their rows still count in *_n_pairs.

Motivation

During evaluation of temperature-simulation-1, an agent produced a physically valid calibration and passed all RMSE thresholds, but failed the verifier because its reported overall_n_pairs did not match the hidden verifier.

The agent did the statistically common thing: it filtered rows with missing observed temperatures before computing metrics. In that run:

the observation table contained 4 rows with temp = NaN,
the agent filtered those rows before counting pairs,
the agent reported overall_n_pairs = 2814,
the verifier counted raw merged rows and expected overall_n_pairs = 2818,
the RMSE values were effectively unchanged because NaN errors do not contribute to the pandas/numpy mean.

The mismatch was therefore not caused by an invalid GLM calibration or an incorrect merge key. It was caused by an underspecified metric-counting convention in the visible task instructions.

Without this clarification, agents can reasonably implement either of these two behaviors:

Filter NaN observations first, which is common in scientific/statistical RMSE computation.
Keep NaN observation rows in the merged table for pair counts, which is what the verifier currently does.

Only the second behavior passes the benchmark. Since the verifier is hidden from agents, the task should make this convention explicit.

What changed

Updated:

tasks/temperature-simulation/temperature-simulation-1/instruction.md
tasks/temperature-simulation/temperature-simulation-2/instruction.md
tasks/temperature-simulation/temperature-simulation-3/instruction.md
tasks/temperature-simulation/temperature-simulation-4/instruction.md
tasks/temperature-simulation/temperature-simulation-5/instruction.md

Added this clarification after the RMSE threshold definition:

When computing the final reported metrics, first perform the exact `datetime` + rounded-depth merge without dropping rows from the observations or from the merged table, even if an observed temperature value is missing or `NaN`. The `overall_n_pairs`, `annual_deep_n_pairs`, and `summer_deep_n_pairs` values must count the raw merged rows in their corresponding subsets before any `NaN` temperature values are excluded from arithmetic. For the RMSE values, compute the error series from the raw merged table and use standard pandas/numpy mean semantics, so `NaN` errors do not contribute to the mean but their rows still count in the `*_n_pairs` fields.

Why this wording

The wording intentionally separates two concepts that are easy to conflate:

*_n_pairs is a raw exact-merge row count.
RMSE is computed from the error series, where NaN errors are ignored by standard pandas/numpy mean behavior.

This mirrors the current verifier behavior while still telling agents exactly what to do without exposing the verifier implementation.

Testing

No benchmark execution was run for this documentation-only change.

Manual validation performed:

inspected the current task instructions,
compared a failing temperature-simulation-1 run against verifier output,
confirmed the failure mode was only overall_n_pairs mismatch (2814 reported vs 2818 expected),
confirmed the 4 missing rows were observation rows with temp = NaN,
confirmed RMSE thresholds were otherwise satisfied.

Risk

Low. This does not change task data, verifier code, thresholds, or expected outputs. It only makes the existing verifier convention visible to agents.

The main behavior change is that future agents are less likely to fail solely because they applied a reasonable but verifier-incompatible NaN filtering convention.

Clarify temperature simulation metric counts

24fdacf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify NaN handling for temperature-simulation metric counts#3

Clarify NaN handling for temperature-simulation metric counts#3
huawei-lin wants to merge 1 commit into
cxcscmu:mainfrom
huawei-lin:clarify-temperature-simulation-nan-metrics

huawei-lin commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huawei-lin commented Jun 1, 2026

Summary

Motivation

What changed

Why this wording

Testing

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant