Clarify NaN handling for temperature-simulation metric counts#3
Open
huawei-lin wants to merge 1 commit into
Open
Clarify NaN handling for temperature-simulation metric counts#3huawei-lin wants to merge 1 commit into
huawei-lin wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This MR updates all
temperature-simulationtask instructions to explicitly define how final self-evaluation metrics should handle missing observed temperatures after the exactdatetime+ rounded-depth merge.The current instructions require agents to:
datetime+ rounded-depth merge,overall_n_pairs,annual_deep_n_pairs, andsummer_deep_n_pairsto/root/metrics.json,However, the instructions do not currently specify whether rows with missing or
NaNobserved temperature values should be kept or filtered before counting*_n_pairs. This creates an ambiguity because the hidden verifier counts raw merged rows for*_n_pairs, while pandas/numpy RMSE arithmetic naturally ignoresNaNerrors when computing the mean.This MR resolves the ambiguity by stating that agents must:
overall_n_pairs,annual_deep_n_pairs, andsummer_deep_n_pairs,NaNerrors do not contribute to the mean but their rows still count in*_n_pairs.Motivation
During evaluation of
temperature-simulation-1, an agent produced a physically valid calibration and passed all RMSE thresholds, but failed the verifier because its reportedoverall_n_pairsdid not match the hidden verifier.The agent did the statistically common thing: it filtered rows with missing observed temperatures before computing metrics. In that run:
temp = NaN,overall_n_pairs = 2814,overall_n_pairs = 2818,NaNerrors do not contribute to the pandas/numpy mean.The mismatch was therefore not caused by an invalid GLM calibration or an incorrect merge key. It was caused by an underspecified metric-counting convention in the visible task instructions.
Without this clarification, agents can reasonably implement either of these two behaviors:
NaNobservations first, which is common in scientific/statistical RMSE computation.NaNobservation rows in the merged table for pair counts, which is what the verifier currently does.Only the second behavior passes the benchmark. Since the verifier is hidden from agents, the task should make this convention explicit.
What changed
Updated:
tasks/temperature-simulation/temperature-simulation-1/instruction.mdtasks/temperature-simulation/temperature-simulation-2/instruction.mdtasks/temperature-simulation/temperature-simulation-3/instruction.mdtasks/temperature-simulation/temperature-simulation-4/instruction.mdtasks/temperature-simulation/temperature-simulation-5/instruction.mdAdded this clarification after the RMSE threshold definition:
Why this wording
The wording intentionally separates two concepts that are easy to conflate:
*_n_pairsis a raw exact-merge row count.NaNerrors are ignored by standard pandas/numpy mean behavior.This mirrors the current verifier behavior while still telling agents exactly what to do without exposing the verifier implementation.
Testing
No benchmark execution was run for this documentation-only change.
Manual validation performed:
temperature-simulation-1run against verifier output,overall_n_pairsmismatch (2814reported vs2818expected),temp = NaN,Risk
Low. This does not change task data, verifier code, thresholds, or expected outputs. It only makes the existing verifier convention visible to agents.
The main behavior change is that future agents are less likely to fail solely because they applied a reasonable but verifier-incompatible NaN filtering convention.