Introduce environment rebuilds in the eval_runner#772
Conversation
There was a problem hiding this comment.
Review: Introduce environment rebuilds in the eval_runner
Nice feature — the ability to rebuild environments periodically for build-time variation resampling is well-motivated, and the metric aggregation approach (concatenating raw data, then recomputing) is the correct way to do this statistically. The Hydra override system for variations is also well-designed with good error messages.
Below are findings roughly ordered by severity.
🔴 Critical: Missing break after failure in rebuild loop (eval_runner.py)
When an exception occurs during a rebuild iteration, job_manager.complete_job(job, ..., status=Status.FAILED) is called, but the for rebuild_idx in range(job.num_rebuilds) loop does not break out. Subsequent rebuild iterations will still execute (potentially hitting more errors or even succeeding partially), leading to:
- The job is already marked FAILED, but more work runs needlessly.
- If a later rebuild succeeds, its metrics are appended to
metrics_per_run. The post-loop aggregation then callsmetrics_logger.append_job_metricsfor a job that was already marked FAILED — inconsistent state.
Suggestion: Add a break (or set a flag and continue) after job_manager.complete_job(...) inside the except block so remaining rebuilds are skipped for that job.
🟡 Medium: Dead commented-out code in metrics_manager.py
The compute() method has ~8 lines of commented-out legacy code after the return statement (unreachable). This appears to be leftover from development/prototyping:
return metrics_data_collection
# dataset_path = get_metric_recorder_dataset_path(self._env)
# metrics_data: dict[str, Any] = {}
# ...Please remove before merge to keep the codebase clean.
🟡 Medium: num_steps-driven jobs with num_rebuilds > 1 run the full num_steps per rebuild (no splitting)
_split_episodes_across_rebuilds returns [None] * num_rebuilds when num_episodes is None (i.e., the steps-driven path). This means each rebuild runs the full num_steps, effectively multiplying the total simulation budget by num_rebuilds. The PR description states "Metrics are averaged across the rebuilds" and "num_episodes now refers to the total number of episodes across all rebuilds" — but no equivalent statement is made for num_steps.
This may be intentional (build-time resampling makes most sense for episode-driven workloads), but it should be explicitly documented. Consider either:
- Documenting that
num_stepsis per-rebuild (unlikenum_episodeswhich is total-across-rebuilds), or - Splitting
num_stepsanalogously tonum_episodes.
🟡 Medium: episode_length_s=0.1 change in gr1_open_microwave_environment.py looks unintentional
The episode length was reduced from 5.0 to 0.1 seconds. This makes the environment essentially instant-reset and is likely a debugging artifact that slipped in. If this is intentional for CI speed, a comment explaining why would be helpful; otherwise please revert.
🟡 Medium: GlobalHydra state cleared unconditionally in variations_hydra.py
if GlobalHydra.instance().is_initialized():
GlobalHydra.instance().clear()This is called from compose_variations_cfg_and_apply_overrides, which is invoked both from ArenaEnvBuilder.compose_manager_cfg() and from the --list-variations printing path (via get_variations_catalogue_as_string). If any other part of the application has initialized Hydra for its own purposes (e.g., a training framework), this will destroy that state.
Suggestion: Consider using a more isolated approach, or at minimum document that this function has a global side-effect. The initialize context manager does clean up on exit, but the explicit clear() before it means any pre-existing Hydra config is lost even if this function raises.
🟢 Low: compile_env_notebook.py contains prototyping/debugging code
The notebook gained ~80 lines of inline aggregation logic (cells doing manual deepcopy, metric_name_to_aggregated_data, print statements like print(f"test: ...")) that duplicate what aggregate_metrics.py now provides. This is fine for development but might confuse future readers if left in the merged state.
🟢 Low: Assertion-based validation in production code paths
_split_episodes_across_rebuilds and aggregate_metrics use assert for input validation (e.g., assert num_episodes >= num_rebuilds). Assertions can be disabled with python -O. If this is a concern in production, consider raising ValueError instead. If assertions are the project convention, this is fine — just flagging for awareness.
🟢 Low: Job.from_dict doesn't validate num_rebuilds type
num_rebuilds = data.get("num_rebuilds", 1) trusts the JSON value. A non-integer value (e.g., "num_rebuilds": "2" or null) would pass silently and fail later. The __init__ assertion num_rebuilds > 0 would catch None with a TypeError but not a string "2". Minor, since JSON configs are typically validated upstream.
Overall: The core design is sound. The metric aggregation and Hydra variation systems are well-tested. The main actionable item is the missing break in the rebuild failure path, which could lead to confusing behavior in production. The rest are cleanup items.
Update (commit 42b081a): Reviewed incremental changes (12 new commits). The new commits add agentic environment generation, per-clone Docker containers, env-graph spec refactoring (type→kind, parent/child→subject/reference), and reload_arena_modules removal from the eval loop. These are well-tested and introduce no new issues in the rebuild feature.
Status of previous findings:
- 🔴 Missing
breakafter failure in rebuild loop — still present. The loop continues aftercomplete_job(..., status=FAILED). - 🟡 Dead commented-out code in
metrics_manager.py— still present (lines afterreturn). - 🟡
episode_length_s=0.1ingr1_open_microwave_environment.py— still present. - 🟡 Other medium/low findings — not addressed but may be out of scope for this PR.
No new issues introduced by the incremental changes.
Greptile SummaryThis PR introduces environment rebuilds in the
Confidence Score: 4/5The rebuild loop and aggregation logic are well-structured, but the HDF5 dataset filename is not rebuild-indexed, which may cause recorded data from earlier rebuilds to be re-read by later ones — inflating episode counts and metric values silently. The core concern is in load_env: dataset_filename is always set to dataset_{job_name} with no rebuild_idx component. If HDF5DatasetFileHandler opens the file in append mode (or if the file persists between rebuilds), rebuild N's compute_metrics() call reads all episodes written by rebuilds 0..N, so metrics_per_run[N] already contains prior-rebuild data before aggregate_metrics concatenates everything again. This would silently produce incorrect aggregated values. The risk is real and should be verified or guarded against before merging with num_rebuilds > 1. isaaclab_arena/evaluation/eval_runner.py — specifically the dataset_filename assignment in load_env() and how rebuild isolation interacts with the HDF5 recorder. Important Files Changed
Sequence DiagramsequenceDiagram
participant Main as eval_runner.main()
participant JM as JobManager
participant LE as load_env()
participant RP as rollout_policy()
participant MM as MetricsManager.compute()
participant AM as aggregate_metrics()
participant ML as MetricsLogger
Main->>JM: iterate jobs
JM-->>Main: job (with num_rebuilds)
Main->>Main: _split_episodes_across_rebuilds()
loop for rebuild_idx in range(job.num_rebuilds)
Main->>LE: load_env(job.arena_env_args, job.name)
LE-->>Main: "env (dataset_filename = dataset_{job_name})"
Main->>RP: rollout_policy(env, policy, num_episodes_this_rebuild)
RP->>MM: env.compute_metrics()
MM-->>RP: MetricsDataCollection (per-rebuild)
RP-->>Main: metrics (MetricsDataCollection)
Main->>Main: metrics_per_run.append(metrics)
end
Main->>AM: aggregate_metrics(metrics_per_run)
AM-->>Main: aggregated MetricsDataCollection
Main->>ML: append_job_metrics(job.name, aggregated)
Reviews (5): Last reviewed commit: "Fix tests. Make metrics lookup-able by n..." | Re-trigger Greptile |
e1e4ee5 to
42b081a
Compare
|
Review Update (77de567): Several previously flagged issues have been addressed in this push: ✅ Fixed: Remaining items (original inline comments still visible):
No new issues found in the incremental changes. |
Summary
Introduce environment rebuilds in the eval_runner
Detailed description
num_rebuildsto the eval spec.num_episodesnow refers to the total number of episodes across all rebuilds.