Skip to content

Sensitivity analysis MVP: MNPE/NPE posterior report#729

Open
cvolkcvolk wants to merge 60 commits into
mainfrom
cvolk/feature/sensitivity_analysis_mvp1
Open

Sensitivity analysis MVP: MNPE/NPE posterior report#729
cvolkcvolk wants to merge 60 commits into
mainfrom
cvolk/feature/sensitivity_analysis_mvp1

Conversation

@cvolkcvolk

@cvolkcvolk cvolkcvolk commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Sensitivity analysis toolbox (MNPE/NPE) running on synthetic data. From an eval sweep's per-episode results, learn which environment conditions drive success: Fit a posterior over the varied factors, conditioned on the outcome, and render one summary figure.

Inspired by robolab, but the factors now come from a factors.yaml and are piped generically, allowing for arbitrary continuous/categorical factor mixes.

Screenshot from 2026-06-11 16-59-44

Detailed description

  • factors.yaml declares the varied factors and ranges. This file will be auto generated by the Variation System and could be moved into one time write into the same output file. For now its hand crafted.

  • SensitivitDataset genericatlly handling n dimensional factors, categorical and continous

  • SensitivityAnalyzer auto-selects MNPE (mixed continuous + categorical) or NPE (continuous-only), trains on the full (theta, x), and samples the joint posterior at a chosen observation. Continuous factors are normalized so factors on very different scales train on equal footing.

  • generate_report produces one figure, a density curve per continuous factor, a probability bar per categorical, saved by file extension (.png/.pdf).

  • A synthetic ground-truth simulator (synthetic.py)
    python -m isaaclab_arena.analysis.sensitivity.synthetic --kind {mixed,continuous,rich} runs the whole pipeline in one command.

Next: Plug in real sim/ variation pipeline

isaaclab-review-bot[bot]

This comment was marked as outdated.

@cvolkcvolk cvolkcvolk changed the title MVP-1: per-episode sensitivity recording + NPE analyzer Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers May 28, 2026
cvolkcvolk added a commit that referenced this pull request May 28, 2026
Builds on the MVP-1 foundation (#729) with categorical factor support, a
cleaner analyzer/plotting separation, and a tighter eval-side / analysis-side
contract that drops a class of drift bugs.

- Analyzer hierarchy (BaseAnalyzer / PosteriorAnalyzer / NPEAnalyzer /
  MNPEAnalyzer / EmpiricalAnalyzer) dispatched via make_analyzer. Pure-
  categorical schemas use empirical frequency analysis directly (under
  uniform prior the posterior is exactly the normalized per-category
  success rate); sbi MNPE 0.26 also requires at least one continuous theta
  column, which this dispatch handles automatically.
- Split inference (analyzer.py) from rendering (plotting.py). Analyzers
  expose continuous_marginal_density and categorical_marginal_probs
  queries; plotting consumes them via plot_marginal. New plot types
  become additive (free functions) without touching the analyzer.
- Drop --factor_keys CLI flag on eval_runner. The writer now logs the
  full arena_env_args per episode; the analyzer-side factors.yaml picks
  what to study. Removes the drift bug class where --factor_keys and
  factors.yaml could disagree.
- Rename JSONL field "factors" -> "arena_env_args". Honest about
  provenance and leaves room for sibling source fields (future "sim_state"
  for MVP-3 reset-time snapshots, "variation_draws" for the variation
  system) without further wire-format changes.
- Add synthetic_data_categorical.py smoke-test generator and rename
  synthetic_data.py -> synthetic_data_continuous.py for symmetry.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@cvolkcvolk cvolkcvolk force-pushed the cvolk/feature/sensitivity_analysis_mvp1 branch from 38baa56 to 74585f1 Compare May 28, 2026 15:36
Adds a policy-sensitivity analysis stack under isaaclab_arena/analysis/
sensitivity/: a SensitivityDataset loader (factors.yaml + episode JSONL),
NPE / MNPE / KDE / empirical analyzers (sbi-backed), continuous + categorical
factor support with LogUniform priors, and an interactive Plotly HTML report.

eval_runner gains an opt-in --episode_summary flag that appends one JSONL row
per recorded episode (full arena_env_args dict + task outcomes); the analyzer
decides which arena_env_args keys are factors via factors.yaml, so eval needs
no knowledge of "factors". Job now carries arena_env_args_dict so the writer
logs typed values. Adds sbi to dev deps.

Driver scripts: analyze_sensitivity.py (single factor/outcome) and
generate_sensitivity_report.py (full multi-factor HTML deliverable).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Paired (factors.yaml, jobs_config.json) sets for pi0 on the droid
pick_and_place_maple_table task:

* light_intensity_sweep — single continuous factor (light intensity)
* pick_up_object_sweep — single categorical factor (object identity)
* multi_factor_overnight_sweep — light_intensity (log-uniform) x 5 objects,
  num_episodes=4
* two_object_shiny_matte_sweep — focused 2-object contrast (matte mustard
  vs specular soup can) x log-uniform light, num_episodes=2

factors.yaml declares each factor's type/range/distribution for the analyzer;
jobs configs are consumed by eval_runner --episode_summary. Use --chunk_size
for the long sweeps to avoid host-RSS OOM.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@cvolkcvolk cvolkcvolk force-pushed the cvolk/feature/sensitivity_analysis_mvp1 branch from 74585f1 to 48fba5d Compare June 3, 2026 15:38
main's metrics refactor (#733) made cfg.metrics a MetricsCfg configclass
(one field per metric) rather than an iterable of metric objects. Iterate
its fields and use compute_metric_func/recorder_term_name/params, matching
MetricsManager. Fixes 'MetricsCfg object is not iterable' that produced
empty episode-summary JSONL.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
sbi logs TensorBoard training curves under <cwd>/sbi-logs by default
(get_log_root hardcodes the cwd), so fitting raised PermissionError when the
cwd wasn't writable — e.g. generating a report from a repo checkout in a
non-root container. A one-shot report fit never reads those curves, so pass a
no-op tracker (_NullTracker) that discards them: no files written, no hidden
cwd dependency, runs from any directory. Centralize the tracker on the base by
having subclasses name their sbi class via _inference_cls.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Grow the two-object light sweep to 500 rubiks_cube + 500 alphabet_soup_can
(num_envs=2, num_episodes=2) for an overnight run with denser log-uniform
light sampling. Update the factors.yaml header to match.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Simplify the deliverable to one PDF on disk showing the most important plots
(robolab-style): an outcome x factor grid of marginal-posterior plots, fit one
analyzer per outcome. Drop the involved Plotly HTML report (report.py +
its CLI) — to be reintroduced in a follow-up PR.

- plotting.py: split the renderers into draw_marginal(ax, ...) that draws onto
  a caller-supplied Axes; plot_marginal keeps its single-figure save behavior.
- pdf_report.py: new generate_pdf_report() lays out the grid and saves one PDF.
- generate_sensitivity_report.py: now drives the PDF (--output_pdf).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Split the report title across two lines (report+episodes / slice) and widen the
top margin so it doesn't clip when the grid is a single column.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Scope PR #729 to the MVP: sensitivity to a single light_intensity factor on one
object. Remove the multi-factor (multi_factor_overnight) and multi-object
(two_object_shiny_matte, pick_up_object) sweep configs — including two 22k-line
job configs — preserved on branch cvolk/feature/sensitivity_large_sweep_configs
for the larger overnight runs. Keep only the single-factor light_intensity
configs. Also drop the stale --factor_keys reference (no such flag; the writer
logs the full arena_env_args).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Collapse the multi-line explanatory comment blocks to 1-2 dense lines, keeping
the non-obvious 'why' (log-space transforms, the step-count/len gotcha, the
deferred pxr import) and dropping the over-explanation. No code changes.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Drop log_uniform from the first PR: the FactorSpec.distribution field, the YAML
parsing/validation, the log10 theta transform + log-space prior in dataset, the
log-grid handling in NPE/KDE marginals, and the log x-axis in plotting. The MVP
sweeps linearly; analyzers and plotting are unchanged for linear factors. The
full log_uniform implementation is preserved on branch
cvolk/feature/sensitivity_log_uniform for a follow-up PR.

Verified post-removal: KDE, MNPE+categorical, and NPE all fit and render.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@cvolkcvolk cvolkcvolk changed the title Sensitivity analysis MVP: per-episode recording + NPE / MNPE / empirical analyzers Sensitivity analysis MVP: per-episode recording, analyzers, single-PDF report Jun 5, 2026
…tracker

- Remove the verbose module-level docstrings across the sensitivity package; the
  two synthetic-data generators and the two CLI scripts now pass a short literal
  argparse description instead of `description=__doc__`.
- Remove the `_NullTracker` workaround. With in-container runs no longer executing
  as root, sbi's default tracker writes a (gitignored) `sbi-logs/` owned by the
  user, so the PermissionError it guarded against no longer occurs.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Comment thread isaaclab_arena/analysis/sensitivity/analyzer.py Outdated
Comment thread isaaclab_arena/analysis/sensitivity/__init__.py
cvolkcvolk added 10 commits June 8, 2026 14:16
…nalyzers

- Make `EmpiricalAnalyzer` an abstract base for the two direct (non-neural) analyzers,
  which estimate the posterior straight from data under a uniform prior. Rename the
  concrete categorical analyzer to `FrequencyTableAnalyzer` and reparent `KDEAnalyzer`
  beneath the same base. Both now share a named `SUCCESS_THRESHOLD` and a `_success_mask()`
  helper instead of inlining the `>= 0.5` success test.
- Update `make_analyzer` dispatch and the plotting/docstring references; fix a stale claim
  that only `PosteriorAnalyzer` provides `continuous_marginal_density` (KDEAnalyzer does too).
- Drop `v0.3`/`MVP-1` wording from the analyzer docstrings, keeping the substantive
  uniform-prior and binary-outcome assumptions.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Break the monolithic analyzer.py into analyzer_base, posterior_analyzer, and
  empirical_analyzer modules along the neural/empirical family seam.
- Move the make_analyzer dispatch into factory.py, re-exported from the package
  __init__; lazy concrete imports keep package import free of torch/sbi so the
  eval-time episode_writer path stays light.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
They're standalone smoke-test tools — not part of the runtime pipeline — so they
don't belong in the production analysis namespace. Relocated to the test-helper
package, ready to back a sim-free analyzer regression test.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Relocate analyze_sensitivity.py / generate_sensitivity_report.py from scripts/ into
analysis/sensitivity/ as analyze.py / generate_report.py, mirroring how
eval_runner/policy_runner live flat inside the evaluation package. Drops the
redundant "sensitivity" prefix now that the package name carries it.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Ship one analyzer per family — KDEAnalyzer (empirical) and MNPEAnalyzer (the sbi
  robolab port) — keeping the reviewable surface small while still demonstrating the
  multi-analyzer design across both the empirical and neural families.
- Park NPEAnalyzer, FrequencyTableAnalyzer, and the now-orphaned categorical synthetic
  data generator on cvolk/feature/sensitivity_deferred_analyzers to bring in later.
- Guard the deferred factor mixes in make_analyzer with clear asserts pointing at that
  branch: pure-categorical → FrequencyTable; multi-continuous or non-binary → NPE.
- Keep the PosteriorAnalyzer/EmpiricalAnalyzer family bases as the extension points the
  parked siblings re-attach to; drop the now-unused binary-outcome warn hook.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Drop git-branch references and the "robolab" attribution from the make_analyzer
asserts and the analyzer docstrings; state the unsupported factor mixes plainly.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
episode_writer is eval-time data production — it runs inside the eval loop, depends on
the metrics/evaluation machinery, and is called only by eval_runner. It has no coupling
to the analysis code (the analyzer consumes the JSONL purely as a format). Relocating it
beside its caller frees the sensitivity package of any pxr/sim dependency.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
The analyze.py CLI rendered one (factor, outcome) marginal to a PNG — a strict subset of
the outcome × factor grid that generate_report already produces. Remove it along with the
now-unused plot_marginal/_plot_title/_save_figure helpers, leaving draw_marginal (used by
the PDF report) as the single rendering path.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
task_duration was a synthetic-only continuous outcome (no matching registered metric) that
existed to exercise NPE. With NPE deferred, a single continuous factor with a non-binary
outcome now asserts, so emitting it made generate_report crash on the smoke dataset. The
generator now emits only the binary success/object-moved outcomes the MVP analyzes.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Address PR review: change the copyright headers on the sensitivity package files
(plus the moved episode_writer and the synthetic generator) from 2025-2026 to 2026,
matching the --use-current-year convention used by new files in the repo.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Comment thread docs/pages/concepts/policy/concept_sensitivity_analysis.rst
Comment thread docs/pages/concepts/policy/concept_sensitivity_analysis.rst Outdated
- move the 'why a joint posterior' section above 'How it works' so the motivation
  for simulation-based inference (MNPE/NPE) comes first
- drop 'hand-authored' from the factors.yaml description (it will be auto-generated)
- remove the eval-pipeline cross-reference sentence and the 'subset of keys' paragraph
- tighten the recording / inference / report wording
- add a TODO to include a sample report figure in 'Reading the output'

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- move synthetic_sensitivity.py (tests/utils) -> analysis/sensitivity/synthetic.py so the
  synthetic example ships with the toolbox and the demo lives in the core module
  (run: python -m isaaclab_arena.analysis.sensitivity.synthetic --kind {mixed,continuous,rich});
  the recovery test now imports it from the package (tests -> package direction)
- dedupe the three make_* builders via _sample_categorical and _build_dataset helpers,
  keeping each fixture's ground-truth logit explicit and the RNG draw order unchanged
- update the docs demo commands to the new module path

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
plot_marginals now takes (samples, dataset, observation) and is a pure renderer — it no
longer runs inference. Callers sample via analyzer.sample_posterior and pass the draws in.
Plotting depends on the dataset (schema/layout), not on the inference object.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Replace the parallel RANGE/WEIGHT constants and the _normalized/_sample_uniform/
_sample_categorical helpers with _ContinuousFactor / _CategoricalFactor dataclasses that
each know how to sample, contribute their success-logit term, and emit their FactorSpec.
The 10 flat constants become 6 self-describing factor instances with signed weights, and
the make_* builders shrink to: sample factors, sum their logits, build. RNG draw order is
preserved, so seeded datasets are unchanged.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
The default observation (condition on success) is derived purely from the dataset's
outcomes and never touches the posterior, so it belongs on SensitivityDataset, not the
inference object. sample_posterior now delegates to dataset.default_observation() and
callers read it off the dataset.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Parked on cvolk/feature/sensitivity_eval_configs_parked; they are experiment-specific
inputs, not part of the toolbox.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
The slice (policy/task/embodiment) was a required, hand-authored block used only as a
report-title caption — never validated against the rows or used in inference. Remove
SliceSpec, the slice schema field/validation, and the title line. factors.yaml now declares
just factors + outcomes. Left a TODO for a robolab-style filter (select one
policy/task/embodiment slice from a larger episode_summary.jsonl) — the operation the slice
block was gesturing at without performing.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Factors carry real schema (continuous/categorical, range/choices) needed to rebuild theta
and the prior, so they stay in factors.yaml. Outcomes carry no structure — always read as
float, just a name — and *which* outcome to condition on is a query, like --observation. So:

- drop the outcomes block from factors.yaml and the OutcomeSpec class (the type field was
  dead — the loader treated every outcome as float)
- the outcome name is now a generate_report --outcome flag (default: success); the dataset
  carries outcome_names (default ('success',)) for x-column labels
- remove the now-dead outcome_columns property
- docs corrected: factors.yaml declares only factors; outcome is chosen at analysis time

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Drop the /eval/ ignore added earlier; leave .gitignore as on main.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Keep #729 focused on the analysis toolbox (analysis/sensitivity + tests + docs), which is
CPU-only and validated on synthetic data. The eval-side recording that produces
episode_summary.jsonl — episode_writer.py plus the eval_runner/CLI/job_manager/metrics
hooks — moves to a stacked follow-up PR where it can land with proper simulation tests.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Comment thread docs/pages/concepts/policy/concept_sensitivity_analysis.rst
@cvolkcvolk cvolkcvolk marked this pull request as ready for review June 11, 2026 14:04
@greptile-apps

This comment was marked as outdated.

Comment thread isaaclab_arena/analysis/sensitivity/generate_report.py
Comment thread isaaclab_arena/analysis/sensitivity/analyzer.py
Comment thread isaaclab_arena/analysis/sensitivity/plotting.py
These are imported at module level by analysis/sensitivity, so they belong in
install_requires, not the [dev] extra — otherwise importing the package fails without
[dev]. The Docker image already installs via pip install -e .[dev], which pulls
install_requires too, so the image is unchanged; [dev] now holds only dev tools.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
nargs='*' let '--observation' be passed with zero values, producing an empty tensor that
reaches sbi as a cryptic shape mismatch. nargs='+' makes argparse reject it clearly;
omitting the flag still falls back to the default observation (success).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Comment thread isaaclab_arena/analysis/sensitivity/generate_report.py
Show only the 'rich' run (3 continuous + 2 categorical) and mention the other --kind
values in a line, instead of three near-identical commands.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Comment on lines +38 to +39
self._continuous_low = torch.tensor([factor.range[0][0] for factor in continuous_factors])
self._continuous_high = torch.tensor([factor.range[0][1] for factor in continuous_factors])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 factor.range indexed without a None guard in __init__

Both list comprehensions on lines 38–39 call factor.range[0][0] and factor.range[0][1] directly. FactorSpec.range defaults to None, and the in-memory constructor path (SensitivityDataset(schema, theta, x)) has no assertion that continuous factors carry a range before SensitivityAnalyzer is constructed. If a user builds a schema in-memory with FactorSpec(name="x", type="continuous") (omitting range), the constructor crashes with TypeError: 'NoneType' object is not subscriptable rather than a legible assertion. The from_files path is safe because _infer_missing_factor_ranges fills the range first, but the in-memory path has no equivalent guard here.

'rich' wasn't descriptive and duplicated the MNPE case. Fold it into make_mixed_dataset
(now 3 continuous + 2 categorical — 'mixed' = mixed factor types) and drop make_rich_dataset.
make_continuous_dataset stays for the NPE path. Two builders, one per estimator, both named
for what they are. The MNPE test now asserts all five planted effects; --kind is {mixed, continuous}.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant