Skip to content

Introduce environment rebuilds in the eval_runner#772

Merged
alexmillane merged 10 commits into
mainfrom
alex/feature/build_time_resampling
Jun 11, 2026
Merged

Introduce environment rebuilds in the eval_runner#772
alexmillane merged 10 commits into
mainfrom
alex/feature/build_time_resampling

Conversation

@alexmillane

Copy link
Copy Markdown
Collaborator

Summary

Introduce environment rebuilds in the eval_runner

Detailed description

  • To resample build-time variations, we need to rebuild the environment periodically.
  • This introduces num_rebuilds to the eval spec.
  • Each rebuild rebuilds the environment.
  • Metrics are averaged across the rebuilds.
  • num_episodes now refers to the total number of episodes across all rebuilds.

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Introduce environment rebuilds in the eval_runner

Nice feature — the ability to rebuild environments periodically for build-time variation resampling is well-motivated, and the metric aggregation approach (concatenating raw data, then recomputing) is the correct way to do this statistically. The Hydra override system for variations is also well-designed with good error messages.

Below are findings roughly ordered by severity.


🔴 Critical: Missing break after failure in rebuild loop (eval_runner.py)

When an exception occurs during a rebuild iteration, job_manager.complete_job(job, ..., status=Status.FAILED) is called, but the for rebuild_idx in range(job.num_rebuilds) loop does not break out. Subsequent rebuild iterations will still execute (potentially hitting more errors or even succeeding partially), leading to:

  1. The job is already marked FAILED, but more work runs needlessly.
  2. If a later rebuild succeeds, its metrics are appended to metrics_per_run. The post-loop aggregation then calls metrics_logger.append_job_metrics for a job that was already marked FAILED — inconsistent state.

Suggestion: Add a break (or set a flag and continue) after job_manager.complete_job(...) inside the except block so remaining rebuilds are skipped for that job.


🟡 Medium: Dead commented-out code in metrics_manager.py

The compute() method has ~8 lines of commented-out legacy code after the return statement (unreachable). This appears to be leftover from development/prototyping:

return metrics_data_collection
# dataset_path = get_metric_recorder_dataset_path(self._env)
# metrics_data: dict[str, Any] = {}
# ...

Please remove before merge to keep the codebase clean.


🟡 Medium: num_steps-driven jobs with num_rebuilds > 1 run the full num_steps per rebuild (no splitting)

_split_episodes_across_rebuilds returns [None] * num_rebuilds when num_episodes is None (i.e., the steps-driven path). This means each rebuild runs the full num_steps, effectively multiplying the total simulation budget by num_rebuilds. The PR description states "Metrics are averaged across the rebuilds" and "num_episodes now refers to the total number of episodes across all rebuilds" — but no equivalent statement is made for num_steps.

This may be intentional (build-time resampling makes most sense for episode-driven workloads), but it should be explicitly documented. Consider either:

  • Documenting that num_steps is per-rebuild (unlike num_episodes which is total-across-rebuilds), or
  • Splitting num_steps analogously to num_episodes.

🟡 Medium: episode_length_s=0.1 change in gr1_open_microwave_environment.py looks unintentional

The episode length was reduced from 5.0 to 0.1 seconds. This makes the environment essentially instant-reset and is likely a debugging artifact that slipped in. If this is intentional for CI speed, a comment explaining why would be helpful; otherwise please revert.


🟡 Medium: GlobalHydra state cleared unconditionally in variations_hydra.py

if GlobalHydra.instance().is_initialized():
    GlobalHydra.instance().clear()

This is called from compose_variations_cfg_and_apply_overrides, which is invoked both from ArenaEnvBuilder.compose_manager_cfg() and from the --list-variations printing path (via get_variations_catalogue_as_string). If any other part of the application has initialized Hydra for its own purposes (e.g., a training framework), this will destroy that state.

Suggestion: Consider using a more isolated approach, or at minimum document that this function has a global side-effect. The initialize context manager does clean up on exit, but the explicit clear() before it means any pre-existing Hydra config is lost even if this function raises.


🟢 Low: compile_env_notebook.py contains prototyping/debugging code

The notebook gained ~80 lines of inline aggregation logic (cells doing manual deepcopy, metric_name_to_aggregated_data, print statements like print(f"test: ...")) that duplicate what aggregate_metrics.py now provides. This is fine for development but might confuse future readers if left in the merged state.


🟢 Low: Assertion-based validation in production code paths

_split_episodes_across_rebuilds and aggregate_metrics use assert for input validation (e.g., assert num_episodes >= num_rebuilds). Assertions can be disabled with python -O. If this is a concern in production, consider raising ValueError instead. If assertions are the project convention, this is fine — just flagging for awareness.


🟢 Low: Job.from_dict doesn't validate num_rebuilds type

num_rebuilds = data.get("num_rebuilds", 1) trusts the JSON value. A non-integer value (e.g., "num_rebuilds": "2" or null) would pass silently and fail later. The __init__ assertion num_rebuilds > 0 would catch None with a TypeError but not a string "2". Minor, since JSON configs are typically validated upstream.


Overall: The core design is sound. The metric aggregation and Hydra variation systems are well-tested. The main actionable item is the missing break in the rebuild failure path, which could lead to confusing behavior in production. The rest are cleanup items.


Update (commit 42b081a): Reviewed incremental changes (12 new commits). The new commits add agentic environment generation, per-clone Docker containers, env-graph spec refactoring (typekind, parent/childsubject/reference), and reload_arena_modules removal from the eval loop. These are well-tested and introduce no new issues in the rebuild feature.

Status of previous findings:

  • 🔴 Missing break after failure in rebuild loop — still present. The loop continues after complete_job(..., status=FAILED).
  • 🟡 Dead commented-out code in metrics_manager.pystill present (lines after return).
  • 🟡 episode_length_s=0.1 in gr1_open_microwave_environment.pystill present.
  • 🟡 Other medium/low findings — not addressed but may be out of scope for this PR.

No new issues introduced by the incremental changes.

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces environment rebuilds in the eval_runner, allowing build-time variations to be resampled across multiple rollouts and their metrics aggregated into a single result. It also refactors the metrics layer from a plain dict to a structured MetricsDataCollection / MetricData dataclass hierarchy.

  • Adds num_rebuilds to Job (defaulting to 1 for backward compatibility), splits num_episodes evenly across rebuilds via _split_episodes_across_rebuilds, and collects per-rebuild MetricsDataCollection objects that are concatenated and recomputed in the new aggregate_metrics helper.
  • Introduces MetricData and MetricsDataCollection dataclasses so recorded per-episode arrays survive alongside the computed metric values, enabling the cross-rebuild recomputation in aggregate_metrics.
  • Updates MetricsLogger, MetricsManager, and environment tests to consume the new typed structure instead of plain dicts.

Confidence Score: 4/5

The rebuild loop and aggregation logic are well-structured, but the HDF5 dataset filename is not rebuild-indexed, which may cause recorded data from earlier rebuilds to be re-read by later ones — inflating episode counts and metric values silently.

The core concern is in load_env: dataset_filename is always set to dataset_{job_name} with no rebuild_idx component. If HDF5DatasetFileHandler opens the file in append mode (or if the file persists between rebuilds), rebuild N's compute_metrics() call reads all episodes written by rebuilds 0..N, so metrics_per_run[N] already contains prior-rebuild data before aggregate_metrics concatenates everything again. This would silently produce incorrect aggregated values. The risk is real and should be verified or guarded against before merging with num_rebuilds > 1.

isaaclab_arena/evaluation/eval_runner.py — specifically the dataset_filename assignment in load_env() and how rebuild isolation interacts with the HDF5 recorder.

Important Files Changed

Filename Overview
isaaclab_arena/evaluation/eval_runner.py Core change: adds rebuild loop and episode-splitting logic. Two issues: the HDF5 dataset filename is not rebuild-indexed (potential double-counting depending on file-handler open mode), and the video folder is shared across rebuilds.
isaaclab_arena/metrics/aggregate_metrics.py New file: correctly concatenates recorded per-episode arrays and recomputes metric values from combined data; symmetric metric-name validation has been addressed (bidirectional set comparison).
isaaclab_arena/metrics/metric_data.py New file: clean dataclass definitions for MetricData and MetricsDataCollection with no issues.
isaaclab_arena/evaluation/job_manager.py Adds num_rebuilds field to Job with default=1 for backward compatibility; validation and table display updated correctly.
isaaclab_arena/metrics/metrics_manager.py Refactored compute() to return a MetricsDataCollection instead of a plain dict; logic is equivalent and correct.
isaaclab_arena/metrics/metrics_logger.py Updated to accept MetricsDataCollection instead of a raw dict; type handling for numpy scalars is preserved.
isaaclab_arena/evaluation/policy_runner.py Minor type-annotation cleanup; rollout_policy return type updated to MetricsDataCollection
isaaclab_arena/tests/test_aggregate_metrics.py New test suite for aggregate_metrics covering accumulation, single-run identity, name mismatch, and empty input; comprehensive coverage.
isaaclab_arena/tests/test_revolute_joint_moved_rate_metric.py Updated to use MetricsDataCollection API instead of plain dict access; logic unchanged.
isaaclab_arena/tests/test_success_rate_metric.py Updated to use MetricsDataCollection API; assertions correctly adapted to metric_data_entries dict access.
isaaclab_arena/environments/isaaclab_arena_manager_based_env.py Return type of compute_metrics() updated to MetricsDataCollection; docstring updated accordingly.

Sequence Diagram

sequenceDiagram
    participant Main as eval_runner.main()
    participant JM as JobManager
    participant LE as load_env()
    participant RP as rollout_policy()
    participant MM as MetricsManager.compute()
    participant AM as aggregate_metrics()
    participant ML as MetricsLogger

    Main->>JM: iterate jobs
    JM-->>Main: job (with num_rebuilds)
    Main->>Main: _split_episodes_across_rebuilds()

    loop for rebuild_idx in range(job.num_rebuilds)
        Main->>LE: load_env(job.arena_env_args, job.name)
        LE-->>Main: "env (dataset_filename = dataset_{job_name})"
        Main->>RP: rollout_policy(env, policy, num_episodes_this_rebuild)
        RP->>MM: env.compute_metrics()
        MM-->>RP: MetricsDataCollection (per-rebuild)
        RP-->>Main: metrics (MetricsDataCollection)
        Main->>Main: metrics_per_run.append(metrics)
    end

    Main->>AM: aggregate_metrics(metrics_per_run)
    AM-->>Main: aggregated MetricsDataCollection
    Main->>ML: append_job_metrics(job.name, aggregated)
Loading

Reviews (5): Last reviewed commit: "Fix tests. Make metrics lookup-able by n..." | Re-trigger Greptile

Comment thread isaaclab_arena_environments/gr1_open_microwave_environment.py Outdated
Comment thread isaaclab_arena/evaluation/eval_runner.py
Comment thread isaaclab_arena/metrics/metrics_manager.py Outdated
Comment thread isaaclab_arena/metrics/metrics_manager.py Outdated
Comment thread isaaclab_arena/examples/compile_env_notebook.py Outdated
@alexmillane alexmillane force-pushed the alex/feature/build_time_resampling branch from e1e4ee5 to 42b081a Compare June 10, 2026 11:24
Comment thread isaaclab_arena/metrics/aggregate_metrics.py Outdated
@isaaclab-review-bot

Copy link
Copy Markdown
Contributor

Review Update (77de567): Several previously flagged issues have been addressed in this push:

Fixed: episode_length_s reverted from 0.1 back to 5.0 in gr1_open_microwave_environment.py
Fixed: Stale commented-out code removed from metrics_manager.py (both the lines at the top of compute() and the unreachable block after return)
Fixed: Debug prints and dead code removed from compile_env_notebook.py (notebook significantly simplified)

Remaining items (original inline comments still visible):

  • complete_job called per-rebuild rather than once per job (eval_runner.py)
  • One-way metric-name validation in aggregate_metrics.py

No new issues found in the incremental changes.

Comment thread isaaclab_arena/metrics/metrics_logger.py Outdated
Comment thread isaaclab_arena/evaluation/eval_runner.py
@alexmillane alexmillane merged commit 97591b4 into main Jun 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants