Skip to content

Rift o4d junior calmarg in loop: AFTER main 'distance' merge#139

Open
oshaughn wants to merge 115 commits into
oshaughn:rift_O4dfrom
oshaughnessy-junior:rift_O4d_junior_calmarg_in_loop
Open

Rift o4d junior calmarg in loop: AFTER main 'distance' merge#139
oshaughn wants to merge 115 commits into
oshaughn:rift_O4dfrom
oshaughnessy-junior:rift_O4d_junior_calmarg_in_loop

Conversation

@oshaughn
Copy link
Copy Markdown
Owner

@oshaughn oshaughn commented Jun 1, 2026

calmarg done in the ILE loop, including

  • 'fused' : new kernel, specifically fast GPU-izing the code
  • 'loop': brute-force backtest, looping over cal realizations
    as well as fancy tools to
  • cal-pilot : adaptively sample in cal parameters, to enable sane results at high SNR

Richard O'Shaughnessy and others added 30 commits May 4, 2026 07:20
Add RIFT.precision and route all extended-precision dtype use through it.

* RIFT.precision (new): RiftFloat resolves to numpy.longdouble whenever
  the platform's long double has itemsize > 8 (e.g. Linux x86_64), and
  otherwise falls back to numpy.float64. Also exports
  RIFT_FLOAT_HIGH_PRECISION and RIFT_FLOAT_NAME. Eliminates the
  import-time AttributeError when numpy.float128 is absent (macOS arm64,
  Windows MSVC, non-x86 Linux, future numpy 2.x platforms).

* Integrator package: replace every numpy.float128 / np.float128 in
    mcsampler.py, mcsamplerEnsemble.py, mcsamplerGPU.py,
    mcsamplerAdaptiveVolume.py, mcsamplerNFlow.py, mcsamplerPortfolio.py,
    statutils.py
  with RiftFloat. The dtype-equality guards in mcsamplerGPU and
  mcsamplerNFlow ("if weights_alt.dtype == numpy.float128: cast to
  float64") degrade gracefully when RiftFloat == float64 (the
  conditional astype becomes a no-op).

* likelihood/factored_likelihood.py and
  interpolators/BayesianLeastSquares.py: same RiftFloat swap, so they
  also import cleanly on platforms without np.float128.

* CI (.github/workflows/ci.yml): add rift_O4d_gmm_gpu to the trigger
  branches; expand the install matrix to 3.9-3.13; convert
  import-check and test-run into a two-lane matrix:
    - legacy : python 3.9  + numpy==1.24.4   (historical green build)
    - modern : python 3.12 + numpy>=2.0,<3.0 (forward-looking gate)
  Numpy is pinned after requirements.txt, so the unpinned 'numpy' line
  in requirements.txt is preserved. Test-log artifacts are named
  per-lane so failures from each can be uploaded independently.

No behavioral change on the existing legacy CI lane: RiftFloat is
numpy.longdouble there, which is the exact 16-byte type previously
spelled numpy.float128. The modern CI lane is the new gate.
Implement a reusable distance-grid export helper for ILE, thread the export flag through pseudo_pipe, and add focused reconstruction tests. Add a zero-spin fake-data demo that builds a DAG with distance-grid export enabled, plus a small lalsimutils XML compatibility fix for current LAL bindings.
Provide a root pixi workspace that defaults local development to SWIG <4.4.0 while also defining a SWIG >=4.4.0 comparison environment.

Add GitLab CI jobs for both pixi environments so deployment stability can be checked across the hidden SWIG binding change.
* RIFT/misc/distance_grid.py: build_distance_grid now divides out the
  distance sampling prior so the exported lnL is L_pure(d) = integral
  L(d,Omega) pi_Omega dOmega. New column ln_prior_d_sampling carries the
  per-bin sampling-prior factor so default reconstruction reproduces
  log_res exactly, while reconstruct_marginal_lnL(grid, ln_prior_d=...)
  re-marginalizes against any prior of choice.
* integrate_likelihood_extrinsic_batchmode: handle mcsamplerEnsemble/GMM
  _rvs columns (raw integrand/joint_prior/joint_s_prior, not log_*); drop
  zero-weight samples cleanly instead of raising "missing type". Pass
  sampler.prior_pdf["distance"] values per-sample.
* test/test_distance_grid.py: cover the pure-likelihood property
  (different priors yield correctly different marginals) and the round
  trip.
* demo/rift/add_distance_grids/validate_distance_grid.py: new stress
  harness quantifies n_eff vs integral/shape error.
* pixi.toml: pin lalsuite==7.25, lalmetaio<=4.0.5 to dodge the SWIG-4.4
  cross-module SwigPyObject/LIGOTimeGPS regression (issue oshaughn#136). Verified
  end-to-end ILE run produces .dgrid whose reconstruction matches log_res
  to machine precision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ILE batchmode now optionally emits a .dslice file per intrinsic point
containing K independent extrinsic-marginalized likelihoods at K distance
slice centers (quantile centers of the posterior in d). The estimator is
importance-reweighting of the main run's Omega samples at each slice
distance, re-using the cached likelihood machinery -- no waveform or PSD
regeneration, no extra worker spin-up. With K~=10 the artifact stays
within the user's <~10x .composite size budget.

* RIFT/misc/distance_slices.py: importance_reweight_slices,
  quantile_slice_centers, table builder/loader, and
  reconstruct_marginal_lnL that takes an optional custom distance prior.
  Schema (DISTANCE_SLICE_FIELDS) deliberately mirrors .composite for
  downstream CIP integration.
* integrate_likelihood_extrinsic_batchmode: new --export-distance-slices
  K and --distance-slice-method flags; threaded into analyze_event after
  the main integration. Reuses sampler._rvs and like_to_integrate.
  Emits a runtime warning when GMM + low main n_eff, since B2-reweight
  silently biases in that regime.
* demo/rift/add_distance_grids/validate_distance_slices.py: synthetic
  stress harness with a known closed-form marginal and an adjustable
  d-Omega coupling. Confirms B2-reweight matches truth to <0.1 nat over
  a wide coupling range when main n_eff is healthy.
* demo/rift/add_distance_grids/PLAN_B_DESIGN.md: design notes covering
  the math, the GMM-vs-AV finding, the (recommended) non-destructive
  workflow integration plan, and the deferred B2-fresh cross-check path.

End-to-end check on the fake-data demo (AV sampler, main n_eff ~6):
B2 marginal reconstructs the main log_res within sigmaL; per-slice
n_eff 7-28; .dslice 10 rows x 19 columns per event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reweight alone breaks in the tails: Omega samples drawn during the main
run have no support at distances far from the posterior peak, so the
slice estimator silently biases or returns garbage there. Switch to a
hybrid scheme where core slices stay reweight (cheap, accurate inside
the posterior) and wing slices are fresh Omega-only AdaptiveVolume
integrations at the pinned distance (correct, expensive only on the few
points we need them).

* RIFT/misc/distance_slices.py:
  - fresh_sample_slices builds a fresh AV sampler over Omega only, clones
    the main sampler's per-param (pdf, prior, llim, rlim) config, wraps
    like_to_integrate to pin distance and defensively clip Omega values
    inside [llim, rlim] (avoids arccos NaN at boundary).
  - pick_wing_centers places K_wing centers log-uniformly in
    [d_min, d_core_lo] union [d_core_hi, d_max], evenly split.
  - is_uninformative detects a flat-in-d core so wings are skipped on
    events where the distance posterior carries no information.
  - sigma_lnL conversion for fresh slices: AV returns log(rel_var) +
    2*log_int; we report sqrt(rel_var) so the column is on the same
    scale as the reweight branch and the main run's sigmaL_main.

* integrate_likelihood_extrinsic_batchmode: split --export-distance-slices
  K into --n-distance-slice-core (reweight) + --n-distance-slice-wing
  (fresh). Default 60/40 split. New flags --distance-slice-wing-nmax,
  --distance-slice-wing-neff, --distance-slice-skip-threshold.
  Per-row method column (reweight=0, fresh=1) marks which estimator
  produced each slice.

* PLAN_B_DESIGN.md: documents the architecture and the empirical wing
  reach on the demo event (~30 nats below peak with sigmaL ~0.1-0.2,
  well past the ~7-nat-target for 10^{-3} prior weight outside).

End-to-end on the fake-data demo (AV, --n-distance-slice-core 6 --n-distance-slice-wing 4):
  main log_res 59.09 +/- 0.30, n_eff 4.4
  core slices: lnL 60.8-62.3 (peak-lnL 0-1.4 nat), sigmaL 0.19-0.55
  wing slices at 23/376/613 Mpc: lnL 59/51/34 (peak-lnL 3/11/29 nat),
    sigmaL 0.10-0.22, neff 9-24
  far wing at 5 Mpc: lnL -100 (signal off), correctly flagged low neff

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups left for the next session, called out as breadcrumbs
rather than implemented now:

1. Skip threshold should be an absolute lnL scale.
   lnL is already a likelihood ratio with absolute meaning; the
   current relative spread test will misfire on high-SNR events
   whose distance posterior happens to be flat.

2. Wing centers from a parabolic-in-1/dist fit of the core, solved
   for the 1/dist values where lnL drops by ~7 nats from peak
   (probability outside ~10^{-3}). Marginalized-lnL caveat (the
   inclination-distance ridge can extend further toward small d
   than a simple parabola predicts) documented inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the two PLAN_B_DESIGN breadcrumbs:

1. is_uninformative now applies an absolute lnL detectability cut (peak
   core lnL < threshold) instead of a relative max-min spread test. lnL
   is a likelihood ratio vs noise, so this correctly skips undetected
   low-SNR events while keeping high-SNR events with a flat distance
   profile. ILE skip message and --distance-slice-skip-threshold help
   updated.

2. pick_wing_centers fits the core (lnL, 1/d) points to a parabola in
   1/d (fit_lnL_parabola_in_inv_d) and spans each wing from the core
   edge out to where the model drops --distance-slice-wing-delta-lnL
   nats below peak (default 7), via _parabolic_wing_bounds. Bounds are
   clamped to the sampler's distance support; degenerate fits fall back
   to the original log-uniform full-range placement. New ILE flag
   --distance-slice-wing-delta-lnL threads the target.

Adds a regression test (test_wing_placement_and_skip) to
validate_distance_slices.py; verified end-to-end on the fake-data demo
(AV sampler): wings concentrate near the core instead of the prior
edges, and reconstruct_marginal_lnL matches log_res within sigmaL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ine builder

Threads per-distance likelihood export from util_RIFT_pseudo_pipe.py
through create_event_parameter_pipeline_* onto the ILE extrinsic stage
(ILE_extr.sub), with an end-to-end pipeline-build test/demo.

CEPP (Basic/Alternate/BasicMultiApprox):
 - New flags --last-iteration-export-distance-slices K plus
   -n-core/-n-wing/-wing-delta-lnL/-skip-threshold passthroughs. When
   set, the extrinsic stage gets --export-distance-slices K (+ the
   tunables + --internal-use-lnL) and --distance-marginalization is
   stripped, mirroring the existing grid export.
 - AlternateIteration and BasicMultiApproxIteration previously lacked
   the grid flag entirely; added both the grid and slice args + the
   ile_args_extr handling so the subdags/multi-approx CEPP variants
   accept the flags pseudo_pipe now routes to them.

util_RIFT_pseudo_pipe.py:
 - New --export-distance-slices K (+ tunables), sibling to
   --export-marginal-distance-grid.
 - When either export is requested: force ILE lnL mode, disable
   distance marginalization (sane auto-config instead of erroring), and
   warn if --add-extrinsic is absent (the export is emitted there).
 - Fix: the --last-iteration-export-* flags are pipeline-builder flags,
   not ILE flags. They were being appended to args_ile.txt (the ILE
   argument string), where they would have been passed to the ILE
   executable and rejected. Move them to the CEPP command; keep only the
   ILE-side hygiene (lnL mode, no distance marginalization) in args_ile.

Make the three create_event_parameter_pipeline_* scripts executable
(100644 -> 100755), matching their sibling bin scripts: pseudo_pipe
invokes them by bare name, so editable/source/pixi installs need +x.

Validation:
 - New demo MonteCarloMarginalizeCode/Code/demo/pipeline (Makefile +
   README): builds baseline/grid/slices pipelines from the reference
   ini and asserts the flags land in ILE_extr.sub (and not in the
   intrinsic ILE.sub), with no distance marginalization. All pass.
 - Expanded .travis/test-build.sh with the same grid + slice build
   assertions (run in GitLab and GitHub CI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… done in PLAN_B_DESIGN

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ic stage

Fix the per-distance export threading so distance marginalization is kept
for the intrinsic ILE iterations (a large speedup) and removed ONLY on the
final extrinsic stage that emits the per-distance output.

Previously util_RIFT_pseudo_pipe.py set opts.internal_marginalize_distance
= False and stripped --distance-marginalization from args_ile.txt, which
disabled it for every ILE job in every iteration. Now pseudo_pipe only
forces ILE lnL mode globally (clean lnL-scaled helper args) and leaves
distance marginalization in place; create_event_parameter_pipeline_* already
strips the standalone --distance-marginalization flag from the ILE_extr
argument string, so the disable is confined to the export stage.

Make util_InitMargTable executable (100644 -> 100755): the helper invokes it
at build time to generate the distance-marginalization lookup table, which is
now needed again because the intrinsic stage keeps distance marginalization.

Validation updated to prove the last-stage-only invariant: demo/pipeline and
.travis/test-build.sh now assert the standalone --distance-marginalization
flag is present on the intrinsic ILE.sub / args_ile.txt but absent from
ILE_extr.sub (matching the standalone flag via a trailing space, so the
harmless leftover --distance-marginalization-lookup-table arg is not counted).
All three demo targets pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n is disabled only at the extrinsic stage

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…validation demo

Adds the consolidation step the previous threading work was missing, plus a
self-contained zero-spin IMRPhenomD demo that runs the whole chain (pipeline
build -> ILE_extr -> consolidate -> posterior) end-to-end without condor.

Pipeline:
 - New util_ConsolidateDistanceGrids.py: concatenates per-event .dgrid/.dslice
   files (header-checked) into a single net intrinsic+distance table.
 - New write_consolidate_distance_grids_sub in RIFT.misc.dag_utils_generic:
   mirrors write_cat_sub (extrinsic posterior samples) so the consolidation
   plugs into the same post-extrinsic part of the DAG.
 - create_event_parameter_pipeline_BasicIteration: when the last-iteration
   per-distance export is on, emit consolidate_dgrid.sub /
   consolidate_dslice.sub gated on the corresponding flag, build a DAG node,
   and chain it as a child of every ILE_extr job (.dgrid / .dslice come
   directly off ILE with no convert/resample step, so the consolidation node
   parents the ILE_extr nodes, not the cat_node downstream).
   Output: all_dgrid.dat / all_dslice.dat at the run root.

Demo + validation:
 - demo/pipeline/Makefile: updated grid/slices assertions to require the
   consolidation sub-file and DAG references.
 - demo/pipeline/zero_spin_phenomD/: new end-to-end test. Uses the
   .travis/ILE-GPU-Paper zero-noise BBH fake data, IMRPhenomD with
   --assume-nospin, AV sampler. Steps:
     build       -> util_RIFT_pseudo_pipe.py constructs the pipeline and
                    asserts the extrinsic stage carries grid export + AV +
                    IMRPhenomD + lnL mode, distance marg only off at the
                    extrinsic stage, consolidate_dgrid in the DAG.
     run-extr    -> bypass condor; invoke ILE_extr directly on N_EVENTS
                    grid rows -> per-event .dgrid files.
     consolidate -> util_ConsolidateDistanceGrids.py -> all_dgrid.dat.
     posterior   -> util_ConstructEOSPosterior.py with --parameter m1 -m2 -dist
                    reconstructs the joint (intrinsic+distance) posterior.
   Whole chain in ~45 s on a laptop core. Ships a minimal zero_spin_phenomD.ini
   whose [rift-pseudo-pipe] section deliberately omits approx /
   ile-sampler-method so the CLI overrides win (the ini section parser
   otherwise overrides the command line).

Drive-by fix needed for the demo:
 - util_ConstructEOSPosterior.py had CRLF line endings, breaking its
   /usr/bin/env shebang ("python\r" not found). Converted to LF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… in PLAN_B_DESIGN

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…--pipeline-builder hot-swap

The --use-subdags path (create_event_parameter_pipeline_AlternateIteration)
was broken for normal runs by a chain of issues, each masking the next:

- cip_args_list parsing crashed on the 'Z'/'G' prefixes emitted by
  util_RIFT_pseudo_pipe.py (ValueError: invalid literal for int() ... 'Z').
  Ported BasicIteration's tolerant prefix parsing.
- argparse rejected 8 options the helper passes (extrinsic samples-per-ile,
  time-resampling, batched-convert, ile-request-disk, cip-explode-jobs-dag/-last,
  n-iterations-subdag-max). Added them with BasicIteration's signatures; wired
  the ones with a clean home, documented the rest as accepted-but-not-acted-on.
- completed the half-built extrinsic batched/time-resampling convert path
  (batchConvertExtr_job was referenced but never defined) by porting
  BasicIteration's 3-branch convert setup + node construction.
- fixed undefined unify_node_list used to attach SCRIPT POST composite checks.

AlternateIteration now builds a complete DAG end-to-end from a standard
pseudo_pipe invocation.

Apply the same int('Z') tolerant-parsing fix to BasicMultiApproxIteration,
which had the identical crash.

Thread AlternateIteration into util_RIFT_pseudo_pipe.py as a first-class
drop-in via a new --pipeline-builder {BasicIteration,AlternateIteration}
selector that overrides the implicit --use-subdags routing, enabling
side-by-side A/B testing of the two builders from an otherwise identical
command line. Warns if an explicit choice contradicts an AMR/subdag
requirement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The mcsamplerEnsemble GPU port (and MonteCarloEnsemble / gaussian_mixture_model)
had never been exercised with cupy installed and crashed in GPU mode. Validated
end-to-end with .travis/test-integrate.sh on a Kepler/sm_30 card using a cupy
10.6 + cudatoolkit 10.2 environment (last CUDA supporting sm_30).

mcsamplerEnsemble:
* evaluate()/calc_pdf(): bridge the host integrand/prior -- convert samples to
  CPU, call the user function, push the result back to the active backend.
* replace cupy-incompatible rot90([list]) with an order-preserving reshape.
* build dim-group / bounds dict keys with host ints (range/np.arange); the
  self.xpy.arange variant produced unhashable 0-d cupy arrays on GPU.
* return scalars and store _rvs on the host so downstream numpy code works.

gaussian_mixture_model / MonteCarloEnsemble:
* portable _xpy_logsumexp (cupyx.scipy.special.logsumexp is absent in the cupy
  CUDA 10.2 build needed for sm_30).
* _near_psd: use Hermitian eigh/eigvalsh on GPU (cupy.linalg has no eig/eigvals);
  the matrices are symmetric. numpy path unchanged.
* gpu_logpdf: cupy.linalg has no LinAlgError and cholesky returns NaN rather than
  raising; catch numpy's error type and treat a NaN factor as failure.

All integrators: import cupyx.scipy.special explicitly (not auto-loaded by
import cupyx in older cupy).

mcsamplerGPU / mcsamplerAdaptiveVolume (so the full test passes in GPU mode):
* use instance converters / self.xpy instead of module-level GPU converters,
  which were contaminating these otherwise-CPU samplers with cupy arrays;
* bridge their CPU prior/integrand; ones_like to follow the data backend.

No behavioral change without cupy: all GPU branches are guarded by cupy_ok.
Note: the AC sampler's --as-test check is statistically flaky (no seed) on both
CPU and GPU and can randomly fail; this is pre-existing and unrelated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… dep canary

Three related pieces for container CI and flexible multi-architecture deployment:

Feature C (core): container family manifest. New RIFT/misc/container_manifest.py
parses a YAML manifest (advertising a family of images + GPU capability ranges)
and builds HTCondor expressions. Wired into write_ILE_sub_simple and
write_CIP_sub (dag_utils_generic.py, import-guarded): when SINGULARITY_RIFT_IMAGE
points at a .yaml/.yml manifest, MY.SingularityImage becomes an expression-valued
ifThenElse over the matched machine's GPU capability (default GPUs_Capability),
selecting the right image per machine. Only the matched image is fetched (CVMFS
images referenced in place / lazy-fetched; osdf images selectively transferred
via a comma-free $$() ternary token), never the whole family. A require_gpus
capability floor is &&-composed with any user RIFT_REQUIRE_GPUS. Plain .sif /
osdf:// values keep byte-identical legacy behavior. Vanilla universe throughout.

Feature B: multi-target build. New containers/ dir -- rift_container.def.in
template + build_family.sh (build matrix; first entry keeps the current
production base for broad compatibility), shared requirements-container.txt
(single source of truth), example rift_container_family.yaml, and README.

Feature A: CI dependency-resolution canary. New non-blocking container-dep-canary
and container-swig-canary jobs in ci.yml, plus a weekly schedule, to catch
upstream breakage (e.g. swig>=4.4.0, issue oshaughn#136) before a container rebuild.

Tests: MonteCarloMarginalizeCode/Code/test/test_container_manifest.py (13 tests:
parser, expression builders, integration via write_ILE_sub_simple condor_cmds,
all-cvmfs, backward-compat).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Validated on a real HTCondor pool + GPU (cap-3.0 machine): GPUs_Capability /
Capability attribute names, require_gpus floor matching+exclusion, $$()
match-time image selection, and tolerance of the empty-result ("") case for a
mixed CVMFS/osdf manifest. Mixed manifests are safe; uniform retrieval is not
required. Only the OSG/GWMS pilot evaluation of the expression-valued
MY.SingularityImage remains to be smoke-tested on a real glidein.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
oshaughn and others added 30 commits June 1, 2026 18:11
…pipe (CI data)

Adds PP_PILOT toggle + `make pp-run-pilot` (and pp-run-pilot-build): pp-run with
--calmarg-pilot, so the full top-level pilot DAG (harvest->dump->fit->consolidate->seed
wide_{N+1}) runs on the CI fake data.  build-validate asserts CALPILOT.sub runs
util_CalPilotStage.py, the CALPILOT job is in the DAG, and the wide ILE args carry the
--calibration-proposal-breadcrumb seed.  Honours OSG/CIT like pp-run.  Verified the build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(task oshaughn#23)

The CALPILOT job runs ILE internally, so on OSG it needs the same container + input set as
a wide ILE job.  write_calpilot_sub gains use_osg/use_singularity/frames_dir/transfer_files
(mirrors write_ILE_sub_simple):
  - runs in the singularity image (exe at SINGULARITY_BASE_EXE_DIR; transfer_executable
    False; MY.SingularityImage/BindCVMFS/flock_local; HAS_SINGULARITY requirement);
  - a calpilot_pre.sh prescript rebuilds local.cache (relative paths) from the transferred
    frames, then execs the stage;
  - transfer_input_files = transfer_files (PSD + cal envelopes, sans the wide grid) +
    frames_dir + composite + args_ile.txt; transfer_output_files = the consolidated
    breadcrumb; stage args reference BASENAMES (no shared FS), workdir '.'.
  - refinement (--prev-breadcrumb) is skipped on OSG (the prev breadcrumb is produced at
    runtime, can't be reliably listed for transfer at iteration 0) -> each OSG pilot is an
    independent cold start, which the prior-shrinkage fit makes safe.
create_event_parameter_pipeline_BasicIteration passes the OSG params + transfer_file_names
to the calpilot job.

ILE robustness: the wide-ILE breadcrumb seed load is now wrapped in try/except -> a
missing/partial/invalid breadcrumb (esp. under OSG file transfer) falls back to PRIOR cal
draws with a warning instead of killing the job.

DONE: the CALPILOT jobs RUN on OSG and produce cal_consolidated_N.npz (transferred back).
REMAINING (task oshaughn#23): consuming the seed on OSG -- transferring cal_consolidated_{N-1}.npz
to the wide_{N+1} ILE jobs (pseudo_pipe basename ref + the iteration-start-absent edge),
so the wide jobs use the learned proposal rather than always falling back to prior.
UNTESTED off-CIT: validate the container + transfer on a real OSG run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ist (task oshaughn#23 complete)

Wide-ILE seed consumption on OSG (util_RIFT_pseudo_pipe.py): when --use-osg-file-transfer,
reference the proposal breadcrumb by BASENAME (cal_consolidated_$(macroiterationprev).npz),
add it to the ILE transfer list, and create a placeholder cal_consolidated_-1.npz so
condor's transfer for the first iteration (prev=-1, never produced) succeeds -- ILE's
breadcrumb load is try/except and falls back to the prior for the placeholder.  So
wide_{N+1} now actually consumes the learned proposal on OSG, not just falls back to prior.

Clean CALPILOT transfer list: write_ILE_sub_simple mutates transfer_file_names in place
(appends frames_dir, ile_pre.sh, the grid), and CALPILOT is built after the wide ILE, so
it had inherited that pollution (frames_dir x3, the wide grid, the ILE prescript).  Snapshot
a clean PSD+cal-envelope transfer list BEFORE those mutations and pass it to the pilot.
Verified: the CALPILOT transfer_input_files now lists each file exactly once (PSD, cal
envelopes, composite, args_ile.txt, frames_dir, calpilot_pre.sh, prev breadcrumb).

This completes the OSG pilot file transfer (CALPILOT runs in-container + transfers I/O;
wide_{N+1} gets the seed).  UNTESTED off-CIT -- validate on a real OSG run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… rundir_pp_run

pp-run-build starts with `rm -rf $(PP_RUN_REAL)`, and pp-run-pilot reused rundir_pp_run --
so launching the pilot demo DESTROYED an in-progress vanilla pp-run.  pp-run-pilot[-build]
now overrides PP_RUN_REAL=rundir_pp_pilot so the two run directories are independent and
neither clobbers the other.  clean removes both.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n the OSG prescript too

On OSG the CALPILOT.sub executable is calpilot_pre.sh (the prescript that rebuilds
local.cache then runs the container's util_CalPilotStage.py), so the stage name is in the
prescript, not CALPILOT.sub.  The build-validate grep now checks BOTH CALPILOT.sub and
calpilot_pre.sh (grep -qs), fixing a spurious "CALPILOT.sub does not run
util_CalPilotStage.py" on OSG.  Pipeline-writer/demo-level only -- no container rebuild.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+ PoC

The decade-old "save the extrinsic distribution to inform the next iteration" goal,
generalized from the cal pilot's breadcrumb.  GMM-first (mcsamplerEnsemble is already
seedable via gmm_dict).

- breadcrumbs.py (schema v2): the extrinsic slot now carries a per-param-group Gaussian
  mixture -- means/covariances/weights/bounds + the param NAMES (so dim-group indices
  reconstruct against the next run's params_ordered).  cal + extrinsic coexist in one
  breadcrumb.  save/load round-trip test (cal + extrinsic) PASSES.
- RIFT/calmarg/extrinsic_handoff.py:
    fit_extrinsic_proposal(samples, log_weights, groups, bounds, n_comp) -- per group, fits
      with RIFT's OWN gaussian_mixture_model.gmm (the exact fitter the sampler uses in
      update_sampling_prior), so stored means/covs are in the model's internal frame and
      restore byte-identical -- no coordinate guesswork, no sklearn.
    gmm_dict_from_breadcrumb(extrinsic, params_ordered) -- reconstructs gmm objects keyed by
      dim-group indices (looked up by name), ready to seed mcsamplerEnsemble's gmm_dict.
  Standard groups (ra,dec),(distance,incl),(phi_orb,psi).  Handles the GMM running on cupy.
- PoC (__main__): synthetic BIMODAL sky posterior -> fit -> breadcrumb -> load -> seed ->
  the seeded sky GMM recovers BOTH modes.  PASS.

Worktree branch rift_O4d_junior_extrinsic_handoff (off the calmarg branch); does not touch
the running pipeline checkout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cept)

Complete the GMM extrinsic-handoff loop:

- ILE (integrate_likelihood_extrinsic_batchmode, EXECUTE-POINT -- needs container
  rebuild): --extrinsic-proposal-output harvests the run's extrinsic posterior
  samples + importance weights from sampler._rvs after integrate (same weight recipe
  as the distance-grid export, incl. the GMM sampler's raw-integrand storage), fits
  per-group GMMs via RIFT.calmarg.extrinsic_handoff, and writes a breadcrumb.
  --extrinsic-proposal-breadcrumb seed side (pre-fill gmm_dict) was added prior.
  Both wrapped in try/except so the handoff can never break a production integration.

- DESIGN_extrinsic_handoff.md: the decade-old "carry the extrinsic posterior between
  iterations" goal, GMM-first rationale (mcsamplerEnsemble.gmm_dict is trivially
  seedable), module/ILE pieces, PoC result, pilot-DAG plug-in plan, and the AV
  partial-reset limitation (task oshaughn#30: AV resets every integrate(), can only contract).

PoC (python -m RIFT.calmarg.extrinsic_handoff) and breadcrumb round-trip
(python -m RIFT.calmarg.breadcrumbs) both PASS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the extrinsic handoff usable end-to-end (standalone; does NOT require the cal
pilot), gated by --extrinsic-handoff and requiring the GMM sampler:

- util_RIFT_pseudo_pipe.py --extrinsic-handoff: thread per-event
  --extrinsic-proposal-output extr_proposal_$(macroiteration)_$(macroevent).npz and the
  seed --extrinsic-proposal-breadcrumb .../extr_consolidated_$(macroiterationprev).npz
  into args_ile.txt (OSG: basename + transfer-list + iteration-0 placeholder; shared FS:
  absolute path), mirroring the cal breadcrumb.  Warns if --ile-sampler-method != GMM.
  Passes --extrinsic-handoff[-select] through to the pipeline builder.

- util_ExtrinsicConsolidate.py (NEW): pick the single most representative per-event
  proposal (default by lnL = nearest the peak; neff/n_samples also available) ->
  extr_consolidated_<it>.npz.  Skips unreadable/placeholder inputs; ALWAYS writes output
  (empty if nothing valid) so the next iteration's seed/transfer never fails.

- dag_utils_generic.write_extrconsolidate_sub (NEW): the consolidation job, LOCAL universe
  on the submit node (pure-python file selection, no GPU/ILE/container/frames).  On OSG the
  per-event ILE outputs are transferred back to <wd>/iteration_<it>_ile, so it reads them
  from the shared FS -- no per-event input transfer (which condor cannot glob).

- create_event_parameter_pipeline_BasicIteration: one consolidation node per iteration,
  gated behind that iteration's unify (ILE barrier), and the next iteration's wide ILE jobs
  depend on it:  unify_{it} -> EXTRCONSOLIDATE_{it} -> wide ILE_{it+1}.

- ILE save side: record true lnL + neff in the proposal breadcrumb meta so consolidation
  can pick the most representative point.

- demo/rift/calmarg: `make extr-build` builds + offline-validates the whole thread
  (args_ile.txt flags, EXTRCONSOLIDATE.sub, unify->consolidate->next-ILE DAG edges);
  separate rundir_pp_extr so it never touches other run dirs.

Verified: `make extr-build` passes; util_ExtrinsicConsolidate standalone tests pass
(picks highest-lnL, skips placeholders, writes empty on no-input).

NOTE: the ILE binary change (--extrinsic-proposal-output/-breadcrumb, save+seed) is
EXECUTE-POINT -- rebuild the container before an OSG/CIT run.  The convergence subdag
(--first-iteration-jumpstart) does not yet carry --extrinsic-handoff (same as --calmarg-pilot).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rget

Found by running the GMM extrinsic handoff on a real GPU (cardassia, NVS 510); the full
loop now works end-to-end (iteration-0 writes proposal -> consolidate -> iteration-1 prints
"Extrinsic GMM SEEDED ... [(4,5),(3,2),(0,1)]" for all three groups -> integrates -> writes
the next proposal):

- reconstruct_gmm: move self.bounds onto the GPU (identity_convert_togpu).  The sampler's
  score()/_normalize write into a cupy array, so a leftover numpy bounds raised
  "non-scalar numpy.ndarray cannot be used for fill".
- gmm_dict_from_breadcrumb(existing_keys=...): match each breadcrumb group to the sampler's
  actual gmm_dict key by dim-SET and permute the stored means/covariances/bounds columns
  into that key's order.  Fixes the phase/pol group being silently dropped because the
  sampler pairs (psi,phi_orb)=(0,1) while the breadcrumb stored (phi_orb,psi)=(1,0).
- reconstruct_gmm(cov_inflate=2.0): broaden the seed (a warm start should be conservative;
  the ensemble sampler can contract but starves if seeded too tight).  Mitigates -- does not
  rescue -- a degenerate source: on a bad batch the sampler _reset()s gmm_dict[k]=None,
  i.e. discards the seed and continues cold (correct safety net).  So seed quality tracks the
  SOURCE iteration's convergence; a useful (accelerating) seed needs n_eff in the hundreds,
  i.e. a real --n-max / larger GPU, not the tiny smoke (n_eff~1 -> seed safely discarded).

- demo/rift/calmarg: `make extr-run[-build]` -- tiny GMM extrinsic-handoff pipeline on the CI
  data (300 initial / 200 per-gen intrinsic, 50 evals/ILE job, n-chunk 4000, n-max bounded to
  40000 vs the 4,000,000 production default, >=2 iterations).  Derives a run-specific ini
  (sed) because [rift-pseudo-pipe] ini values override the CLI.  Separate rundir_pp_extr_run.

DESIGN_extrinsic_handoff.md: documents the GPU validation, the two bugs, and the
seed-quality-vs-source-convergence finding.

ILE binary change is EXECUTE-POINT (container rebuild for OSG/CIT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Background
----------
util_ConstructIntrinsicPosterior_GenericCoordinates.py has long used
three CLI flags for declaring how a parameter is treated:

  --parameter X         both fit dim AND MC sampling dim
  --parameter-implied X fit dim only (the converter produces X from the
                        data file's columns; the MC integrator never
                        sees it)
  --parameter-nofit X   MC sampling dim only (the integrator integrates
                        over it; the fit never sees it)

util_ConstructEOSPosterior.py declared the same three flags but never
honoured them: the integrator at line 487 hardcoded
`low_level_coord_names=dat_orig_names` in its convert_coords closure,
which only worked when the sampling basis equalled the data-file basis;
sampler.add_parameter iterated over coord_names (the fit basis); the
arity dispatch for likelihood_function keyed on len(coord_names); and
sampler.integrate was passed *coord_names rather than
*low_level_coord_names. The net effect: any user who tried to fit in a
transformed basis (e.g. via the new --supplementary-coordinate-code
plugin) silently got a wrong likelihood evaluation -- the rotation was
applied an extra time inside convert_coords every Monte Carlo step.

What this commit changes
------------------------

bin/util_ConstructEOSPosterior.py
  * Parameter-resolution block rewritten to mirror IntrinsicPosterior's
    semantics, plus a clean fallback to dat_orig_names when none of the
    three flags are supplied (legacy bare-invocation unchanged). Seven
    CLI permutations now map to documented (coord_names,
    low_level_coord_names) pairs.
  * The convert_coords closure used by the integrator captures
    low_level_coord_names as its input basis (was dat_orig_names). The
    initial dat->X conversion still uses dat_orig_names, since that's
    the basis of the file columns.
  * Sampler add_parameter loop now iterates over low_level_coord_names
    (the MC basis), and sampler.integrate is passed *low_level_coord_names.
  * The arity-dispatched likelihood_function definitions key on
    len(low_level_coord_names) and route every input -- including the
    scalar branches -- through convert_coords so a non-trivial converter
    is never silently bypassed.
  * Output-writer iterates samples by low_level_coord_names (the keys
    sampler._rvs actually carries) and applies the "constant fill"
    check in the sampling basis, not the fit basis. Implied (fit-only)
    coords correctly skip the output file.
  * Added a guard: if low_level_coord_names != coord_names but no
    coordinate plugin is supplied, raise a clear error instead of
    silently feeding samples through an identity convert_coords into a
    fit built in a different basis.
  * Help text for --parameter / --parameter-implied / --parameter-nofit
    rewritten to describe what each flag actually does now.

RIFT/hyperpipe/coords.py
  * HyperCoordSpec.from_strings accepts integration ranges for names in
    coords-nofit (the MC sampling basis is coords-fit + coords-nofit);
    unknown range names are still rejected.
  * HyperCoordSpec.validate accepts empty coords-fit so long as
    coords-implied covers the fit basis and coords-nofit covers the
    sampling basis; emits distinct errors for empty-fit vs empty-sample.
  * to_parameter_args emits --integration-parameter-range for the
    sampling basis (parameters + nofit), not just parameters.
  * to_puff_args and to_test_args emit --parameter for the sampling
    basis -- the puff lane and convergence-test driver operate on the
    data-file columns, which is the sampling basis after decoupling.

RIFT/hyperpipe/config.py
  * validate_config accepts empty coords-fit when coords-implied
    (fit-side) and coords-nofit (sample-side) are non-empty.

demo/hyperpipe/hyperpipe_conf_linear_uvw.yaml
  * Rewritten to actually exercise the decoupled path: coords-implied
    "u v w" (fit), coords-nofit "x y z" (sample), coords-sample ranges
    in (x, y, z), coord-module pointing at the linear plugin with the
    uvw_rotated chart. Iteration / puff / marg stay in (x, y, z); the
    EOS posterior fits in (u, v, w) and writes its posterior in
    (x, y, z).

Verified
--------
  * Parameter-resolution unit test (in this commit's worktree) covers 7
    CLI permutations -- legacy no-flags, legacy --parameter, IntrinsicPosterior
    --parameter+implied and --parameter+nofit, the new --implied-only,
    --implied+nofit, and full --parameter+implied+nofit -- all map to
    the documented (coord_names, low_level_coord_names) pairs.
  * HyperCoordSpec unit test covers the new decoupled emit (post sees
    implied/nofit and ranges; puff/test see the sampling basis only),
    a legacy-regression case (unchanged output), the two new validation
    errors (empty fit, empty sample), the new "range for nofit name"
    permission, and the still-rejected "unknown range name" case.
  * AST + yaml parses on every edited file.
  * validate_config passes on hyperpipe_conf_linear_uvw.yaml plus the
    demo's baseline and tracer yamls.
… GPU

Attempting the seed-acceleration demo on the CI point (SNR~17.5, lnLmax~90-115) showed the
ensemble (GMM) sampler does not converge there: n_eff pinned at ~1 through ~200k samples,
with OR without calmarg (vanilla GMM: 1.00007 at 196k / 50 iterations).  GMM collapses onto
the dominant sample at a sharp high-SNR peak; AV (the production sampler) handles these but is
not seedable.  So the GMM->GMM handoff is correct+safe but cannot bootstrap a useful seed on
real high-SNR data -- its payoff is gated on seedable/partial-reset AV (task oshaughn#30/oshaughn#25) or a
cross-sampler AV->GMM seed (fit_extrinsic_proposal already accepts any sampler's samples).
Recorded in DESIGN_extrinsic_handoff.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sal-adapt) + cross-sampler findings

- --extrinsic-proposal-adapt (default OFF = freeze): the seeded GMM groups are no longer
  re-fit each iteration.  Re-fitting a seed on a bad first batch dies in the GMM init
  (random.choice "probabilities are not non-negative") and triggers _reset, discarding the
  seed.  _train already skips groups with gmm_adapt=False, so freezing preserves the seed.
  Freezing is also the right semantics for a handed-off / cross-sampler proposal.  With
  freeze the seeded run completes with 0 resets and n_eff rises from cold ~1 to ~5-10.

- DESIGN_extrinsic_handoff.md: document the cross-sampler AV->GMM result.  AV converges as a
  source (n_eff~7 at 400k, lnLmax~143); the frozen seed lands cleanly and lifts n_eff, but the
  seeded GMM INTEGRAL is wrong (sqrt(2 lnLmax)=nan, Z~1e-4 vs cold ~1e43) -- the proposal is
  importance-sampling a displaced region.  Two suspects to audit (no more blind GPU): AV-vs-GMM
  _rvs coordinate convention (angle vs cosine for incl/dec), and cov_inflate pushing distance
  out of [1,1000] into NaN likelihood.  Same-sampler GMM->GMM round-trips cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d weights, ESS n_comp, distmarg)

Debugging the wrong-integral the GPU run showed (the seeded GMM integrated as if lnL~44 vs
the true ~140), found+fixed four real issues; the cross-sampler seed is now numerically
correct (finite lnLmax, valid Z) end-to-end:

1. SAVE side used the sampler's stored log_weights, but mcsamplerGPU/AV stores
   log_weights = tempering_exp*lnL + ln(prior) - ln(s_prior) (adapt-weight-exponent baked in).
   Fitting the GMM to those flattened weights displaces the proposal.  Now build the TRUE
   untempered weight from log_integrand + log_joint_prior - log_joint_s_prior and prefer the
   raw components over 'log_weights'.  (GMM's own _rvs is untempered -> GMM->GMM unaffected.)
   This took the seeded n_eff from ~5 to ~26.
2. cov_inflate default 2.0 -> 1.0: a frozen seed should match the source, not be widened
   (inflation pushes samples past hard bounds -> NaN likelihood).
3. fit_extrinsic_proposal: cap mixture components by the weight ESS (k <= ESS/(d+2)) and DROP
   any non-finite component (renormalize; skip group if none survive).  A starved source
   collapses a component to a singular/NaN covariance that poisons the whole seed.
4. The persistent nan lnLmax was distance sampled against [1,1000]: a seeded distance Gaussian
   spills past the bound -> NaN.  The calmarg path is meant to run with --distance-marginalization
   (the fused kernel IS a distmarg kernel); with distmarg on, the seeded integral is finite and
   valid.  (Gap: pseudo_pipe/extr-run don't add --distance-marginalization yet -- noted in doc.)

Result (distmarg on, CI point SNR~17.5): seeded GMM has 0 resets, finite lnLmax, valid Z, but
n_eff ~1 == cold n_eff ~1.  The handoff is correct+safe but does NOT accelerate here because
GMM does not converge on this peak (cold or seeded) and the AV source (n_eff~5) is too
under-converged to inform a strong seed.  Hard evidence the payoff needs seedable AV (task oshaughn#30)
or a converged source.  Full analysis in DESIGN_extrinsic_handoff.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nstance (XML compat)

copy_lsctables_sim_inspiral iterated lsctables.SimInspiralTable.validcolumns.keys() (the full
schema) and did bare getattr(row, simattr) for string columns (waveform/source/numrel_data/
taper) and numeric columns.  On the current igwn_ligolw + lalsuite stack, ILE-written
sim_inspiral tables contain only the columns actually set, so the schema view and the written
columns drift apart -> reading a saved ILE output_*.xml.gz raised
"AttributeError: 'SimInspiral' object has no attribute 'waveform'"
(and would equally fail on any absent numeric column via the else branch).

Fix: skip columns not present on the row instance (hasattr guard), after the
process_id/simulation_id default-setting branch (which doesn't read the row).  RIFT's own
grids (written via lsctables.New with all columns) are unaffected; column-subset tables now
round-trip.  Verified both a 300-row RIFT grid and a waveform-less ILE-style table read with
no AttributeError.

Per /home/oshaughn/BREADCRUMB_rift_xml_compat.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… for the calmarg pipeline

There was no pipeline-code gap: --internal-marginalize-distance already composes cleanly with
--calmarg-fused-kernel (verified -- args_ile gets --distance-marginalization + a util_InitMargTable
lookup table AND --calibration-fused-kernel), and the fused kernel does NOT require distmarg (it
has both Q_fused_calmarg_cupy and Q_fused_calmarg_distmarg_cupy; the ILE binary wires whichever
applies).  The only gap was that the demo targets didn't expose distmarg.

- demo Makefile: PP_DMARG toggle (default 0, optional) -> --internal-marginalize-distance
  --internal-distance-max PP_DMAX, threaded into extr-build, extr-run-build, pp-run-build.
  (Distinct from the direct-ILE dag-build DMARG knob, which uses a pre-built lookup table.)
  extr-validate checks --distance-marginalization + lookup table when PP_DMARG=1.  Verified
  `make extr-build PP_DMARG=1` passes.
- DESIGN_extrinsic_handoff.md: corrected -- distmarg is OPTIONAL with the fused kernel, not
  required; RECOMMENDED with --extrinsic-handoff (removes distance + its hard bound from the
  seeded GMM proposal, which was the source of the boundary-NaN).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fairdraw cupy crash)

BREADCRUMB_export_cal_posterior.md: the final export with extrinsics did not carry the
recovered calibration posterior.  Now --calibration-export-posterior (ILE) /
--calmarg-export-posterior (pseudo_pipe): at the fairdraw export, for each fair-draw sample
draw ONE cal realization in proportion to its posterior weight (per-realization likelihood
components from return_cal_components, times the importance weight cal_log_weights) and write
a SELF-CONTAINED sibling <output>_<event>_cal.dat with the FULL draw -- intrinsic + extrinsic
+ the drawn realization's spline nodes as labeled cal_<IFO>_amp_<k>/cal_<IFO>_phase_<k>
columns.  (The fairdraw LIGOLW/.dat schema can't carry arbitrary columns, so per the user the
cal posterior rides a row-aligned sibling .dat with the whole draw, plottable as-is.)
  - node retention: the production prior path now keeps the cal node vectors
    (draw_prior_realizations_with_nodes) when the flag is set; the seed path already returns them.
  - verified on GPU: writes 1 sample x 90 cols incl 60 cal cols (amp_0..9 + phase_0..9 over H1,L1,V1).

Also fix a PRE-EXISTING crash this surfaced: mcsamplerEnsemble (GMM sampler) fairdraw on GPU
did `self.xpy.min([n_extr, 1.5*eff_samp, 1.5*neff])` -- cupy.min has no Python-list overload
("'list' object has no attribute 'min'"), so ANY GMM-sampler fairdraw export on GPU crashed
(independent of calmarg).  Use Python min() of floats.

ILE binary is EXECUTE-POINT (container rebuild to run on OSG/CIT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+ thread cal-export into demo

PILOT OSG bug: iteration 0 seeds from cal_consolidated_$(macroiterationprev).npz with
macroiterationprev=-1 -> cal_consolidated_-1.npz, the 0-byte placeholder pseudo_pipe creates
so condor's transfer_input_files of that path does not fail.  Locally the file simply does
not exist (the "missing -> fall back to PRIOR" check fires); on OSG it IS transferred in, so
it EXISTS but is empty, and np.load raised "EOFError: No data left in file", crashing the
first-iteration ILE.  Fix: treat a missing OR EMPTY breadcrumb as "not present yet" (size
guard before any load), for BOTH the calibration and extrinsic seed paths.  EXECUTE-POINT --
rebuild the container.

demo/rift/calmarg: PP_CALPOST toggle (default 1) threads --calmarg-export-posterior into
pp-run-build and extr-run-build, so the recovered cal posterior is written in the runnable
demos.  (Does NOT touch a running rundir_pp_run -- pp-run-build starts with its own rm -rf.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nders; works on old container)

Complement to the ILE empty-breadcrumb size-guard: make the iteration-0 placeholder a VALID
breadcrumb that LOADS cleanly, so the pipeline-writer fix ALONE (no container rebuild) keeps an
older ILE binary from crashing on it.

- generate_realizations.prior_cal_breadcrumb_dict(env_dir, dets, fmin, fmax, n_spline_points):
  build the 'cal' breadcrumb for the broad PRIOR with proposal == prior.  Seeding from it
  draws cal realizations from the prior with ZERO importance weights -- exactly equivalent to
  the cold prior draws.  Layout matches seed_realizations_from_breadcrumb (per-det
  [amp,phase] blocks; dim = 2N*len(dets)).
- util_RIFT_pseudo_pipe.py: on OSG file-transfer, write cal_consolidated_-1.npz as that valid
  prior breadcrumb (was a 0-byte file) and extr_consolidated_-1.npz as a valid EMPTY breadcrumb
  (extrinsic=None -> cold).  Falls back to a 0-byte file only if the build fails (then the ILE
  size-guard catches it).  PIPELINE-WRITER change -- no container rebuild needed.
- util_CalMakePriorBreadcrumb.py (NEW): (re)generate the prior placeholder for an ALREADY-built
  run dir IN PLACE (overwrite the 0-byte cal_consolidated_-1.npz), so an in-flight pilot run can
  be patched without re-running pseudo_pipe or rebuilding the container.

Verified: the placeholder loads + seeds with max|cal_log_weights| ~ 1e-14 (== prior draws).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lmarg flags

The demo grew from the single-ILE correctness check into a ladder up to a runnable condor
pipeline.  Document all targets grouped by what they exercise (A: numerical correctness +
single-ILE; B: direct-ILE DAG + tuning; C: offline pipeline build-validate incl. extrinsic
handoff; D: runnable pipeline on CI data + pilots + extrinsic-handoff GPU run), the runnable
toggles (OSG/PP_PILOT/PP_DMARG/PP_CALPOST/PP_NIT), the helper utils, and the advanced pipeline
flags (--calmarg-export-posterior, --internal-marginalize-distance, --calmarg-pilot,
--extrinsic-handoff).  Add the recovered-cal-posterior section, the iteration-0 prior
placeholder note (+ util_CalMakePriorBreadcrumb.py), and the execute-point vs pipeline-writer
rule.  Points to DESIGN_adaptive_driver.md / DESIGN_extrinsic_handoff.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds --supplementary-coordinate-{code,function,ini,chart} plus the two
input/output parameter list flags to plot_posterior_corner.py, mirroring
the surface already in util_ConstructEOSPosterior.py.  When the plugin
flag is set, _materialize_plugin_columns runs once per loaded posterior
and once per loaded composite file *after* the existing RIFT
postprocessing -- it computes the plugin's output columns from existing
record-array fields and splices them in via add_field.

Critically, the hook is strictly ADDITIVE.  Any output name already
present in samples.dtype.names is skipped, so the hardcoded
extract_combination_from_LI and the per-file postprocess loops
(mc / eta / chi_eff / LambdaTilde / chi1_perp / ...) always win.  Legacy
invocations with no --supplementary-coordinate-code flag are byte-
identical to the pre-plugin tool -- the helper returns input unchanged
when the converter is None.

CLI surface
-----------
  --supplementary-coordinate-code SPEC
        'rift_default' | filesystem path to a .py | dotted module name.
  --supplementary-coordinate-function NAME
        Entry-point callable. Defaults to 'convert_coordinates'.
  --supplementary-coordinate-ini PATH
        Optional; parsed and handed to prepare().
  --supplementary-coordinate-chart NAME
        Required only when the plugin defines multiple charts.
  --supplementary-coordinate-input-parameter NAME   (action='append')
        Override the plugin-declared INPUT_PARAMETERS / chart's
        input_parameters list.
  --supplementary-coordinate-output-parameter NAME  (action='append')
        Override the plugin-declared OUTPUT_PARAMETERS / chart's
        parameters list.

When the input / output name lists aren't given on the CLI they're
resolved from CHARTS[chart] (input_parameters, parameters) and then from
the module-level INPUT_PARAMETERS / OUTPUT_PARAMETERS attributes.

Verified
--------
Five synthetic cases on an (m1, x, y, z) record array with the linear
plugin requesting (u, v, w):
  * happy path -- (x, y, z) -> (u, v, w) values match
    u=(x+y)/sqrt(2), v=(y-x)/sqrt(2), w=z on every row; (m1, x, y, z)
    untouched.
  * output-reorder -- works regardless of the order u/v/w are listed.
  * name-collision -- samples pre-seeded with u=99; the plugin leaves u
    alone (RIFT path wins) and still adds v, w.
  * missing-input -- samples without an x column; helper logs a skip,
    returns input unchanged, no crash.
  * no-plugin -- helper is identity (out is samples).
RePrimAnd's Python API changed: the NS-accuracy factory tov_acc_simple was
renamed star_acc_simple (now taking two leading bool flags need_deform,
need_bulk, then acc_tov, acc_deform, minsteps), and make_tov_branch_stable
replaced num_samp/mgrav_min with mg_cut_low_rel/mg_cut_low_abs/gm1_step. The old
calls raised TypeErrors against current installs. Add version-robust shims
(_pyr_star_acc, _pyr_tov_branch, _pyr_interval) that target the modern API and
fall back to the legacy one, so EOSReprimand/make_mr_lambda_reprimand work with
either pyreprimand. make_eos_barotr_spline and the star_branch accessors are
unchanged. Also fix the read_tov_sequence path (load_star_branch takes a
filename, not the eos object). Verified against the RePrimAnd 1.7 docs.

Co-Authored-By: Claude <noreply@anthropic.com>
… plot tool

Brings in three commits from the HyperpipeCoordinates worktree:

  8aaba0c  util_ConstructEOSPosterior: decouple fit basis from MC sampling basis
            (--parameter-implied / --parameter-nofit semantics ported from
            util_ConstructIntrinsicPosterior_GenericCoordinates.py, plus
            integrator, sampler, arity dispatch and output-writer rewires;
            hyperpipe yaml schema relaxed so coords-fit can be empty when
            coords-implied / coords-nofit carry the bases.)
  d1995aa  demo/hyperpipe: README tour of the four yaml configs
  5643386  plot_posterior_corner: additive coordinate-plugin hook
            (--supplementary-coordinate-* flags, never overrides a name the
            hardcoded RIFT path already produced.)

Safety check before merging:
  * merge-base with calmarg_in_loop = 26f8d83 (the calmarg breadcrumb).
  * Files touched by eospost_coords: util_ConstructEOSPosterior.py,
    plot_posterior_corner.py, RIFT/hyperpipe/{coords,config}.py,
    demo/hyperpipe/{README.md, hyperpipe_conf_linear_uvw.yaml}.
  * Files touched by calmarg_in_loop since the merge-base: all under
    demo/rift/calmarg/ and the calmarg-pilot-lane drivers.
  * Path overlap between the two = empty.
  * git merge-tree produces zero conflict markers.
  * Same path-disjoint result against origin/rift_O4d_junior_extrinsic_handoff
    (a sibling junior branch cross-checked at merge time).
…nm_backend' into rift_O4d_junior_calmarg_in_loop
…yreprimand 1.7)

pyreprimand's star_acc_simple is (*, need_deform, need_bulk, acc_tov, acc_deform,
minsteps) -- all keyword-only -- so the positional call in _pyr_star_acc raised
TypeError. Pass by keyword. Also add a defaults fallback for make_tov_branch_stable.

Co-Authored-By: Claude <noreply@anthropic.com>
Brings the puff lane into the same coordinate-plugin framework already
used by util_ConstructEOSPosterior and plot_posterior_corner.  When
--supplementary-coordinate-code is supplied, both
util_HyperparameterPuffball.py and util_HyperparameterTracerUpdate.py
operate in the PLUGIN basis: forward-transform the file's input-basis
columns into the basis named by --parameter, do the covariance estimation
/ SMC / birth-death / puff-displacement step in that basis, then
INVERSE-transform back to the file basis to write the output .dat in the
same column structure the rest of the pipeline expects.

The legacy code path is byte-identical when no plugin is supplied --
--parameter names are file columns, _extract_X reads them directly, the
write-back uses opts.parameter -> cols.index(name).  The plugin-or-not
branch is the same `if plugin_active` predicate in every site.

Required plugin contract addition: inverse_convert_coordinates(y_in,
coord_names, low_level_coord_names, **kwargs) -> (N, len(low_level_coord_names)).
The puff lane MUST round-trip through the plugin basis, so we bail out
loudly if the plugin doesn't define an inverse rather than silently
using a pseudo-inverse (which would give subtly-wrong placements).

CLI surface (both executables)
------------------------------
  --supplementary-coordinate-code SPEC
        'rift_default' | filesystem path to a .py | dotted module name.
  --supplementary-coordinate-function NAME
        Entry-point callable. Defaults to 'convert_coordinates'.
  --supplementary-coordinate-ini PATH
        Optional; parsed and handed to prepare().
  --supplementary-coordinate-chart NAME
        Required only when the plugin defines multiple charts.
  --supplementary-coordinate-input-parameter NAME (action='append')
        File-column name to feed the plugin as an input dimension.  If
        omitted, CHARTS[chart].input_parameters / INPUT_PARAMETERS is used.

linear_coordinate_convert.py: inverse_convert_coordinates
---------------------------------------------------------
Closed-form x = A^{-1} (y - b) with cached A^{-1}.  Requires a square,
non-singular A; raises if A is non-square (pseudo-inverse is ambiguous
for the puff use case) or if the input doesn't span every output
dimension declared in OUTPUT_PARAMETERS.  Honors permuted coord_names /
low_level_coord_names orders.

Verified
--------
Synthetic test on a 2000-point (u,v,w)-diagonal Gaussian rotated into
(x,y,z), driven through the tracer's puffball-mode regression path:

  * plugin puff_factor=0.5 yields per-axis (u,v,w) variance ratios of
    [1.24, 1.25, 1.25] -- the textbook (1 + puff_factor^2) growth.
  * uvw off-diagonal correlation stays < 0.04 (diag-cov data, diag-cov
    delta).
  * output .dat header is (lnL, sigma_lnL, x, y, z) -- the file's
    original basis is preserved across the round-trip.
  * puff_factor=0 leaves the grid unchanged modulo the tracer's existing
    near-singular-cov regularization (~1e-4 residual on xyz of stdev ~0.8).
  * Legacy --parameter x y z path runs and displaces the grid unchanged.
  * Missing input column -> clean error.
  * Plugin without inverse_convert_coordinates -> clean error at load.

Also verified linear_coordinate_convert.inverse round-trips to 4.4e-16
against the forward transform.
…oshaughnessy-junior/research-projects-RIT into rift_O4d_junior_calmarg_in_loop

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
…ILE too

Follow-up to the puff-barrier fix (c4c1455): the puffball ILE jobs run with the SAME
args_ile as the normal wide ILE, so they read the same iteration-(it-1) seed breadcrumb
(--calibration-proposal-breadcrumb / --extrinsic-proposal-breadcrumb).  The normal-ILE seed
barriers are applied via ile_node_list_per_iteration BEFORE the puffball ILE jobs are created,
so the puffball jobs were missing them and would race the it-1 consolidation that produces the
seed file (silently falling back to the prior).  Now that c4c1455 added extra_parent_nodes,
thread last_puff_node AND (when pilots/handoff are active) calpilot_{it-1}/extrconsolidate_{it-1}
into the puffball ILE's extra_parent_nodes, so every wide ILE job of iteration `it` (normal +
puffball) waits for the same seed barrier.

Verified by building a puff+pilot DAG: the puff node's children are 200/200 ILE_puff jobs and
0 normal ILE jobs (no halt), and a puffball ILE job depends on both ParameterPuffball and
CalPilotStage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants