Introduce environment rebuilds in the eval_runner by alexmillane · Pull Request #772 · isaac-sim/IsaacLab-Arena

alexmillane · 2026-06-10T11:15:43Z

Summary

Introduce environment rebuilds in the eval_runner

Detailed description

To resample build-time variations, we need to rebuild the environment periodically.
This introduces num_rebuilds to the eval spec.
Each rebuild rebuilds the environment.
Metrics are averaged across the rebuilds.
num_episodes now refers to the total number of episodes across all rebuilds.

isaaclab-review-bot

Review: Introduce environment rebuilds in the eval_runner

Nice feature — the ability to rebuild environments periodically for build-time variation resampling is well-motivated, and the metric aggregation approach (concatenating raw data, then recomputing) is the correct way to do this statistically. The Hydra override system for variations is also well-designed with good error messages.

Below are findings roughly ordered by severity.

🔴 Critical: Missing `break` after failure in rebuild loop (`eval_runner.py`)

When an exception occurs during a rebuild iteration, job_manager.complete_job(job, ..., status=Status.FAILED) is called, but the for rebuild_idx in range(job.num_rebuilds) loop does not break out. Subsequent rebuild iterations will still execute (potentially hitting more errors or even succeeding partially), leading to:

The job is already marked FAILED, but more work runs needlessly.
If a later rebuild succeeds, its metrics are appended to metrics_per_run. The post-loop aggregation then calls metrics_logger.append_job_metrics for a job that was already marked FAILED — inconsistent state.

Suggestion: Add a break (or set a flag and continue) after job_manager.complete_job(...) inside the except block so remaining rebuilds are skipped for that job.

🟡 Medium: Dead commented-out code in `metrics_manager.py`

The compute() method has ~8 lines of commented-out legacy code after the return statement (unreachable). This appears to be leftover from development/prototyping:

return metrics_data_collection
# dataset_path = get_metric_recorder_dataset_path(self._env)
# metrics_data: dict[str, Any] = {}
# ...

Please remove before merge to keep the codebase clean.

🟡 Medium: `num_steps`-driven jobs with `num_rebuilds > 1` run the full `num_steps` per rebuild (no splitting)

_split_episodes_across_rebuilds returns [None] * num_rebuilds when num_episodes is None (i.e., the steps-driven path). This means each rebuild runs the full num_steps, effectively multiplying the total simulation budget by num_rebuilds. The PR description states "Metrics are averaged across the rebuilds" and "num_episodes now refers to the total number of episodes across all rebuilds" — but no equivalent statement is made for num_steps.

This may be intentional (build-time resampling makes most sense for episode-driven workloads), but it should be explicitly documented. Consider either:

Documenting that num_steps is per-rebuild (unlike num_episodes which is total-across-rebuilds), or
Splitting num_steps analogously to num_episodes.

🟡 Medium: `episode_length_s=0.1` change in `gr1_open_microwave_environment.py` looks unintentional

The episode length was reduced from 5.0 to 0.1 seconds. This makes the environment essentially instant-reset and is likely a debugging artifact that slipped in. If this is intentional for CI speed, a comment explaining why would be helpful; otherwise please revert.

🟡 Medium: GlobalHydra state cleared unconditionally in `variations_hydra.py`

if GlobalHydra.instance().is_initialized():
    GlobalHydra.instance().clear()

This is called from compose_variations_cfg_and_apply_overrides, which is invoked both from ArenaEnvBuilder.compose_manager_cfg() and from the --list-variations printing path (via get_variations_catalogue_as_string). If any other part of the application has initialized Hydra for its own purposes (e.g., a training framework), this will destroy that state.

Suggestion: Consider using a more isolated approach, or at minimum document that this function has a global side-effect. The initialize context manager does clean up on exit, but the explicit clear() before it means any pre-existing Hydra config is lost even if this function raises.

🟢 Low: `compile_env_notebook.py` contains prototyping/debugging code

The notebook gained ~80 lines of inline aggregation logic (cells doing manual deepcopy, metric_name_to_aggregated_data, print statements like print(f"test: ...")) that duplicate what aggregate_metrics.py now provides. This is fine for development but might confuse future readers if left in the merged state.

🟢 Low: Assertion-based validation in production code paths

_split_episodes_across_rebuilds and aggregate_metrics use assert for input validation (e.g., assert num_episodes >= num_rebuilds). Assertions can be disabled with python -O. If this is a concern in production, consider raising ValueError instead. If assertions are the project convention, this is fine — just flagging for awareness.

🟢 Low: `Job.from_dict` doesn't validate `num_rebuilds` type

num_rebuilds = data.get("num_rebuilds", 1) trusts the JSON value. A non-integer value (e.g., "num_rebuilds": "2" or null) would pass silently and fail later. The __init__ assertion num_rebuilds > 0 would catch None with a TypeError but not a string "2". Minor, since JSON configs are typically validated upstream.

Overall: The core design is sound. The metric aggregation and Hydra variation systems are well-tested. The main actionable item is the missing break in the rebuild failure path, which could lead to confusing behavior in production. The rest are cleanup items.

Update (commit 42b081a): Reviewed incremental changes (12 new commits). The new commits add agentic environment generation, per-clone Docker containers, env-graph spec refactoring (type→kind, parent/child→subject/reference), and reload_arena_modules removal from the eval loop. These are well-tested and introduce no new issues in the rebuild feature.

Status of previous findings:

🔴 Missing break after failure in rebuild loop — still present. The loop continues after complete_job(..., status=FAILED).
🟡 Dead commented-out code in metrics_manager.py — still present (lines after return).
🟡 episode_length_s=0.1 in gr1_open_microwave_environment.py — still present.
🟡 Other medium/low findings — not addressed but may be out of scope for this PR.

No new issues introduced by the incremental changes.

greptile-apps · 2026-06-10T11:20:48Z

Greptile Summary

This PR introduces environment rebuilds in the eval_runner, allowing build-time variations to be resampled across multiple rollouts and their metrics aggregated into a single result. It also refactors the metrics layer from a plain dict to a structured MetricsDataCollection / MetricData dataclass hierarchy.

Adds num_rebuilds to Job (defaulting to 1 for backward compatibility), splits num_episodes evenly across rebuilds via _split_episodes_across_rebuilds, and collects per-rebuild MetricsDataCollection objects that are concatenated and recomputed in the new aggregate_metrics helper.
Introduces MetricData and MetricsDataCollection dataclasses so recorded per-episode arrays survive alongside the computed metric values, enabling the cross-rebuild recomputation in aggregate_metrics.
Updates MetricsLogger, MetricsManager, and environment tests to consume the new typed structure instead of plain dicts.

Confidence Score: 4/5

The rebuild loop and aggregation logic are well-structured, but the HDF5 dataset filename is not rebuild-indexed, which may cause recorded data from earlier rebuilds to be re-read by later ones — inflating episode counts and metric values silently.

The core concern is in load_env: dataset_filename is always set to dataset_{job_name} with no rebuild_idx component. If HDF5DatasetFileHandler opens the file in append mode (or if the file persists between rebuilds), rebuild N's compute_metrics() call reads all episodes written by rebuilds 0..N, so metrics_per_run[N] already contains prior-rebuild data before aggregate_metrics concatenates everything again. This would silently produce incorrect aggregated values. The risk is real and should be verified or guarded against before merging with num_rebuilds > 1.

isaaclab_arena/evaluation/eval_runner.py — specifically the dataset_filename assignment in load_env() and how rebuild isolation interacts with the HDF5 recorder.

Important Files Changed

Filename	Overview
isaaclab_arena/evaluation/eval_runner.py	Core change: adds rebuild loop and episode-splitting logic. Two issues: the HDF5 dataset filename is not rebuild-indexed (potential double-counting depending on file-handler open mode), and the video folder is shared across rebuilds.
isaaclab_arena/metrics/aggregate_metrics.py	New file: correctly concatenates recorded per-episode arrays and recomputes metric values from combined data; symmetric metric-name validation has been addressed (bidirectional set comparison).
isaaclab_arena/metrics/metric_data.py	New file: clean dataclass definitions for MetricData and MetricsDataCollection with no issues.
isaaclab_arena/evaluation/job_manager.py	Adds num_rebuilds field to Job with default=1 for backward compatibility; validation and table display updated correctly.
isaaclab_arena/metrics/metrics_manager.py	Refactored compute() to return a MetricsDataCollection instead of a plain dict; logic is equivalent and correct.
isaaclab_arena/metrics/metrics_logger.py	Updated to accept MetricsDataCollection instead of a raw dict; type handling for numpy scalars is preserved.
isaaclab_arena/evaluation/policy_runner.py	Minor type-annotation cleanup; rollout_policy return type updated to MetricsDataCollection
isaaclab_arena/tests/test_aggregate_metrics.py	New test suite for aggregate_metrics covering accumulation, single-run identity, name mismatch, and empty input; comprehensive coverage.
isaaclab_arena/tests/test_revolute_joint_moved_rate_metric.py	Updated to use MetricsDataCollection API instead of plain dict access; logic unchanged.
isaaclab_arena/tests/test_success_rate_metric.py	Updated to use MetricsDataCollection API; assertions correctly adapted to metric_data_entries dict access.
isaaclab_arena/environments/isaaclab_arena_manager_based_env.py	Return type of compute_metrics() updated to MetricsDataCollection; docstring updated accordingly.

Sequence Diagram

sequenceDiagram
    participant Main as eval_runner.main()
    participant JM as JobManager
    participant LE as load_env()
    participant RP as rollout_policy()
    participant MM as MetricsManager.compute()
    participant AM as aggregate_metrics()
    participant ML as MetricsLogger

    Main->>JM: iterate jobs
    JM-->>Main: job (with num_rebuilds)
    Main->>Main: _split_episodes_across_rebuilds()

    loop for rebuild_idx in range(job.num_rebuilds)
        Main->>LE: load_env(job.arena_env_args, job.name)
        LE-->>Main: "env (dataset_filename = dataset_{job_name})"
        Main->>RP: rollout_policy(env, policy, num_episodes_this_rebuild)
        RP->>MM: env.compute_metrics()
        MM-->>RP: MetricsDataCollection (per-rebuild)
        RP-->>Main: metrics (MetricsDataCollection)
        Main->>Main: metrics_per_run.append(metrics)
    end

    Main->>AM: aggregate_metrics(metrics_per_run)
    AM-->>Main: aggregated MetricsDataCollection
    Main->>ML: append_job_metrics(job.name, aggregated)

_{Reviews (5): Last reviewed commit: "Fix tests. Make metrics lookup-able by n..." | Re-trigger Greptile}

isaaclab-review-bot · 2026-06-10T11:40:19Z

Review Update (77de567): Several previously flagged issues have been addressed in this push:

✅ Fixed: episode_length_s reverted from 0.1 back to 5.0 in gr1_open_microwave_environment.py
✅ Fixed: Stale commented-out code removed from metrics_manager.py (both the lines at the top of compute() and the unreachable block after return)
✅ Fixed: Debug prints and dead code removed from compile_env_notebook.py (notebook significantly simplified)

Remaining items (original inline comments still visible):

complete_job called per-rebuild rather than once per job (eval_runner.py)
One-way metric-name validation in aggregate_metrics.py

No new issues found in the incremental changes.

alexmillane requested review from cvolkcvolk, peterd-NV, qianl-nv, viiik-inside, xyao-nv and zhx06 as code owners June 10, 2026 11:15

isaaclab-review-bot Bot reviewed Jun 10, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

alexmillane added 6 commits June 10, 2026 13:23

WIP

9d4fed8

Got metric aggregation working in a notebook

d7fe3d9

Solve issue with aliasing.

cc775c3

Prepare for claude to clean up.

6d01d97

Add num_rebuilds to the job spec

e94aa12

Split num_episodes over the rebuilds.

42b081a

alexmillane force-pushed the alex/feature/build_time_resampling branch from e1e4ee5 to 42b081a Compare June 10, 2026 11:24

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread isaaclab_arena/metrics/aggregate_metrics.py Outdated

alexmillane added 2 commits June 10, 2026 13:29

Self review

7138a63

Revert some files

77de567

xyao-nv reviewed Jun 11, 2026

View reviewed changes

Comment thread isaaclab_arena/metrics/metrics_logger.py Outdated

xyao-nv approved these changes Jun 11, 2026

View reviewed changes

Respond to review.

380e8d4

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread isaaclab_arena/evaluation/eval_runner.py

Fix tests. Make metrics lookup-able by name.

f8af81f

alexmillane merged commit 97591b4 into main Jun 11, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce environment rebuilds in the eval_runner#772

Introduce environment rebuilds in the eval_runner#772
alexmillane merged 10 commits into
mainfrom
alex/feature/build_time_resampling

alexmillane commented Jun 10, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexmillane commented Jun 10, 2026

Summary

Detailed description

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review: Introduce environment rebuilds in the eval_runner

🔴 Critical: Missing break after failure in rebuild loop (eval_runner.py)

🟡 Medium: Dead commented-out code in metrics_manager.py

🟡 Medium: num_steps-driven jobs with num_rebuilds > 1 run the full num_steps per rebuild (no splitting)

🟡 Medium: episode_length_s=0.1 change in gr1_open_microwave_environment.py looks unintentional

🟡 Medium: GlobalHydra state cleared unconditionally in variations_hydra.py

🟢 Low: compile_env_notebook.py contains prototyping/debugging code

🟢 Low: Assertion-based validation in production code paths

🟢 Low: Job.from_dict doesn't validate num_rebuilds type

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isaaclab-review-bot Bot left a comment •

edited

Loading

🔴 Critical: Missing `break` after failure in rebuild loop (`eval_runner.py`)

🟡 Medium: Dead commented-out code in `metrics_manager.py`

🟡 Medium: `num_steps`-driven jobs with `num_rebuilds > 1` run the full `num_steps` per rebuild (no splitting)

🟡 Medium: `episode_length_s=0.1` change in `gr1_open_microwave_environment.py` looks unintentional

🟡 Medium: GlobalHydra state cleared unconditionally in `variations_hydra.py`

🟢 Low: `compile_env_notebook.py` contains prototyping/debugging code

🟢 Low: `Job.from_dict` doesn't validate `num_rebuilds` type

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading