diff --git a/.felt/docker-uv-revert/docker-uv-revert.md b/.felt/docker-uv-revert/docker-uv-revert.md index 1a68f4c39..dd138a97f 100644 --- a/.felt/docker-uv-revert/docker-uv-revert.md +++ b/.felt/docker-uv-revert/docker-uv-revert.md @@ -6,11 +6,11 @@ tags: - docker - infra created-at: 2026-04-27T11:26:45.677512058+02:00 -outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for Martin''s review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.' +outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.' decisions: base: label: Base image - rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. Martin's canfar concern is satisfied as long as the slim image works on canfar. + rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. The canfar deployment concern is satisfied as long as the slim image works on canfar. default: python-slim options: python-slim: @@ -50,7 +50,7 @@ decisions: label: uv + pyproject + uv.lock; uv sync --frozen in Dockerfile modernize: label: Modernize package versions - rationale: 'We determined which versions MUST stay pinned: only ngmix (Axel''s stable_version branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.' + rationale: 'We determined which versions MUST stay pinned: only ngmix (pinned to a stable_version fork branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.' default: stay-current options: stay-conservative: @@ -58,7 +58,7 @@ decisions: excluded: true excluded_reason: Drift between pyproject signal and lockfile reality; loses the chance to surface numpy-2/pandas-3 incompatibilities at PR time when CI is fast stay-current: - label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to Axel's stable_version branch + label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to its stable_version fork branch insights: ci-fast: claim: 'First CI run on PR #719 went green in 3m31s. uv installed 238 packages in 322ms — everything resolved to prebuilt wheels, no source compilation of galsim/mpi4py/python-pysap/etc. Massive speedup vs. previous build.' @@ -97,11 +97,10 @@ The `--frozen` flag is the discipline mechanism: a stale lockfile cannot ship. ## Followups - Watch CI on #719. The slim-base apt list is conjectural — galsim/mpi4py/python-pysap pull a lot of system deps and we may need to add more (`libatlas-base-dev`, `libblas-dev`, etc). -- If CI needs anything beyond what's in the apt block, that's the surface that benefits from a [[shapepipe/prs-in-flight]] note for next time. -- After this lands, [[shapepipe/prs-in-flight]] PRs #708 and #714 may need a small rebase. -- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — Martin floated it but we agreed slim should work on canfar. +- If CI needs anything beyond what's in the apt block, that's worth noting for next time. +- After this lands, PRs #708 and #714 may need a small rebase. +- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — floated as a possibility, but slim should work on canfar. ## Connections - [[shapepipe]] — root -- [[shapepipe/prs-in-flight]] — touches the testing-scaffold xfail set and the develop-bugs PR diff --git a/.felt/fabian-coord-bug/fabian-coord-bug.md b/.felt/fabian-coord-bug/fabian-coord-bug.md deleted file mode 100644 index 66213d20c..000000000 --- a/.felt/fabian-coord-bug/fabian-coord-bug.md +++ /dev/null @@ -1,10 +0,0 @@ ---- -name: Fabian's coord-propagation bug + image-sim code on github -tags: - - shapepipe - - bug - - collaboration - - future -created-at: 2026-04-27T11:26:52.878118978+02:00 -outcome: 'Fabian: 1-line fix in shapepipe needs porting; first need him to put image-sim code/configs on github so it''s testable. Beg if necessary.' ---- diff --git a/.felt/ngmix-update/ngmix-update.md b/.felt/ngmix-update/ngmix-update.md index 2df017deb..723871e65 100644 --- a/.felt/ngmix-update/ngmix-update.md +++ b/.felt/ngmix-update/ngmix-update.md @@ -1,9 +1,9 @@ --- -name: ngmix library upgrade + Lucy wrapper sync +name: ngmix library upgrade + wrapper sync tags: - shapepipe - ngmix - future created-at: 2026-04-27T11:26:51.026191639+02:00 -outcome: 'Future: replace Axel''s stable_version fork with upstream ngmix; reconcile with Lucy''s cleaned-up wrapper from her visit' +outcome: 'Replace the pinned ngmix fork (a stable_version branch carrying not-yet-upstreamed fixes) with upstream ngmix once those land; reconcile the wrapper afterward.' --- diff --git a/.felt/prs-in-flight/prs-in-flight.md b/.felt/prs-in-flight/prs-in-flight.md deleted file mode 100644 index ff110eb0e..000000000 --- a/.felt/prs-in-flight/prs-in-flight.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -name: PRs in flight after v2 merge -tags: - - shapepipe - - pr -created-at: 2026-04-27T11:26:49.300097608+02:00 -outcome: 'Post-v2 + post-propagation: infra stream now landed (#718 setuptools, #719 uv-lockfile, #728 dependabot+SHA-pin), supply-chain hygiene done (20 → 0 alerts). Issue #712 empirically verified resolved against current `:develop` (all 11 packages in Martin''s May 18 list import in both read-only and writable sandbox modes); comment posted, awaiting Martin reply before closing. Science PRs still open: #714 develop-bugs (closes #709 + #711 only — #712 closes separately), #708 testing-scaffold (mine); #725 centroid shift (Axel), several older Martin PRs (#704 #703 #699 #660 #650 #636), #670 lbaumo file_io. Next thread: merge #714.' -insights: - 714-already-redundant: - claim: 'Surprise from rebasing #714: its Dockerfile commit (cf304f8f, adding astroquery/numba/fitsio + setuptools<81 pin) was *already* redundant on current develop — the v2 merge silently put astroquery/numba/fitsio into pyproject and the v2 Dockerfile installs them via ''pip install -e ".[fitsio]"'' at the end. setuptools<81 went away via #718. So ''rebase to drop the obsolete commit'' wasn''t waiting on #719 — it was already obsolete the moment v2 merged. Worth checking sooner next time before assuming a fix is still load-bearing.' - xfail-mostly-fixable: - claim: 'Most #708 xfails are about to be resolved: canfar_monitor IndentationError (4 xfails) and summary_run -h (1 xfail) are fixed in #714; astroquery/numba/fitsio import xfails (5 modules) resolve in #719 because uv sync installs them from pyproject. Only stile/treecorr corr2 (4 modules) is a separate issue requiring stile removal or upstream patch.' - dependabot-policy: - claim: 'shapepipe now ships `.github/dependabot.yml` (#728) with 14-day cooldown, monthly grouped lockfile PRs, github-actions ecosystem opted in, and SHA-pinned actions across all four workflows. Reasoning lives in the file itself + the #728 PR body. Companion fiber [[shapepipe/sqlitedict-pickle-smell]] tracks the single dismissed alert.' - 712-empirically-resolved: - claim: 'Issue #712 is empirically resolved against current `ghcr.io/cosmostat/shapepipe:develop` (dev target, post-#728). Both the original packages (astroquery, numba, fitsio) and Martin''s May 18 follow-up list (scipy, joblib, importlib_metadata, tqdm, LSSTDESC.Coord, pyyaml, astropy_iers_data, pyerfa) import cleanly in both read-only and writable sandbox modes, as do the three originally-flagged runner modules. Pyproject confirms astroquery/numba/joblib/tqdm are core deps; the rest are transitives of astropy/mccd/modopt/galsim; fitsio is gated in both runtime (`--extra jupyter --extra fitsio`) and dev (`--extra dev`) targets. Comment posted; awaiting Martin reply before closing. Likely root cause of the May 18 report: cached/older image.' -decisions: - setuptools-pin: - label: drop setuptools<81 pin - default: merged - options: - merged: - label: 'Already merged as #718 (c9e71df8) — small one-liner, agreed in transcript' ---- - -Snapshot of CosmoStat/shapepipe PR state, maintained as a living index. - -## Open — infra - -(All infra PRs landed. The dependabot stream is resolved; supply-chain -posture set; SHA-pins in place. See [[shapepipe/sqlitedict-pickle-smell]] -for the one open security-fiber.) - -## Open — issues (mine) - -| # | What | Status | -|---|---|---| -| #712 | Dockerfile missing runtime deps | Empirically resolved against current `:develop` ([comment](https://github.com/CosmoStat/shapepipe/issues/712#issuecomment-4562085977)). Both original list (astroquery/numba/fitsio) and Martin's May 18 follow-up (scipy/joblib/importlib_metadata/tqdm/LSSTDESC.Coord/pyyaml/astropy_iers_data/pyerfa) import cleanly in read-only + writable sandbox modes. Awaiting Martin reply before closing. | -| #711 | summary_run -h crashes | Fixed by #714 (auto-closes on merge) | -| #709 | canfar_monitor IndentationError | Fixed by #714 (auto-closes on merge) | - -## Open — mine (science / fixes) - -| # | Branch | What | Status | -|---|---|---|---| -| #731 | `chore/smoke-test-read-only` | smoke-test in read-only mode | Open. Adds `shapepipe_run_example` wrapper; CI now runs the entry-point smoke under `docker --read-only --tmpfs /tmp:rw`. See [[shapepipe/smoke-test-read-only]]. | -| #714 | `fix/develop-bugs` | small develop bugs (#709, #711) | Open. Originally a multi-bug fix; the Dockerfile portion got absorbed into #719. Worth checking what's still load-bearing here vs already-fixed-upstream. | -| #708 | `chore/testing-scaffold` | Tier 0–2 test scaffolding | Open. Some xfails should have flipped to xpass after the v2 + uv-lockfile work; needs a rebase + xfail-list audit. | - -## Open — others' PRs awaiting attention - -| # | Author | What | -|---|---|---| -| #725 | aguinot | Fix centroid shift | -| #704 | martinkilbinger | Contributors | -| #703 | martinkilbinger | V1.3.x | -| #699 | martinkilbinger | Coverage mask | -| #670 | lbaumo | file_io handles sextractor header | -| #660 | martinkilbinger | Existing output directory | -| #650 | martinkilbinger | Third-party catalogue for tile objects | -| #636 | martinkilbinger | Rho statistics: flexible training/test split | - -## Recently closed - -- **#728** `chore/dependabot-config` — dependabot.yml + SHA-pin all actions. Merged 2026-05-28. -- **#727, #726, #724, #722, #721, #720** — dependabot security bumps for idna/urllib3/gitpython/mistune/jupyter-server/jupyterlab. All squash-merged 2026-05-28 (see [[shapepipe/dependabot-pr-triage]]). -- **#719** `chore/uv-lockfile` — merged 2026-05-05 (Martin). -- **#718** `chore/drop-setuptools-pin` — merged. -- **v2.0 PR** — merged. Source of the skaha/conda situation that #719 unwound. - -## Connections - -- [[shapepipe]] — root -- [[shapepipe/docker-uv-revert]] — drove #719 -- [[shapepipe/dependabot-pr-triage]] — drove the 6 security-bump merges (closed) -- [[shapepipe/sqlitedict-pickle-smell]] — future-work fiber for the one dismissed alert diff --git a/.felt/shapepipe.md b/.felt/shapepipe.md index 40d321969..044d7a3b1 100644 --- a/.felt/shapepipe.md +++ b/.felt/shapepipe.md @@ -1,50 +1,40 @@ --- -name: ShapePipe maintenance & PRs +name: ShapePipe — project knowledge & active threads tags: - shapepipe - - portolan created-at: 2026-04-27T11:26:38.71538657+02:00 -outcome: 'Root: collaboration with Martin on ShapePipe — PRs, infra, future ngmix and Fabian work' +outcome: 'Root of ShapePipe''s felt store: the stack division, repo conventions, and the why behind in-flight infra/cleanup threads.' --- -ShapePipe is the UNIONS shape-measurement pipeline. I'm not the primary -maintainer (that's Martin Kilbinger); my role is collaborator helping -clean up infra, surface bugs, and keep the merge queue moving while -Martin focuses on science threads. +This is the root of ShapePipe's felt store — shared notes on architecture +decisions, conventions, and in-flight work, for the team and AI agents alike. +ShapePipe is the UNIONS galaxy shape-measurement pipeline; `CLAUDE.md` covers the +build / container / CI overview, and the fibers here carry the *why*. Start here, +then follow the links. -## Working agreement with Martin +## Stack division -Surfaced over a 2026-04-27 walking conversation. Captured in -[[shapepipe/prs-in-flight]] and the per-thread fibers below. +ShapePipe **produces** shear catalogues; `sp_validation` / `cosmo_val` +**consume** and validate them; `cs_util` holds code shared across both. A concern +about *validating* catalogues belongs downstream, not in ShapePipe. -- I review and patch his PRs; he reviews mine. Bugs found during review - go to a dedicated PR rather than getting bundled into his feature - branch (per `feedback_separate_infra_prs`). -- v2.0 was merged fast (it was ready). The skaha base it brought in is - the active source of pain → see [[shapepipe/docker-uv-revert]]. -- I file the issues; Claude usually drafts the PRs in my voice. - Disclosure on Claude-only review per - `feedback_claude_only_review_disclosure`. - -## Active threads - -- **[[shapepipe/docker-uv-revert]]** — slim Python + uv lockfile, drop conda. PR #719 (draft). -- **[[shapepipe/prs-in-flight]]** — tracking #708 (testing scaffold), #714 (develop bugs), #719 (this one). - -## Future work +## Conventions specific to this repo -- **[[shapepipe/ngmix-update]]** — replace Axel's stable_version fork - with upstream ngmix; reconcile with Lucy's wrapper. -- **[[shapepipe/fabian-coord-bug]]** — port Fabian's 1-line coord - propagation fix; first need his image-sim code on github. +- **Rho-statistics are obsolete inside ShapePipe.** PSF-systematics validation + moved downstream to `sp_validation` / `cosmo_val` (via `shear_psf_leakage`); + the stile/treecorr rho code was removed in #715. But the **meanshapes / + ellipticity focal-plane plots** (`mccd_plots_runner`) are *deliberately kept* — + they are a general PSF/star-catalogue diagnostic, not rho-stats, and feed + catalogue-paper figures. Don't delete that path along with rho-stats; see + [[shapepipe/cleanup-rhostats-jobscripts]] for where the boundary actually sits. +- Run the pipeline through the container; use `python3.12` explicitly inside it. +- **ngmix** is pinned to a fork branch until fixes land upstream — don't bump + that dependency line. [[ngmix-update]] tracks the path back to upstream. -## Conventions specific to this repo +## Active threads -- Container runs through `app` (apptainer wrapper); use `python3.12` - inside the shapepipe container (see `reference_containers`). -- ShapePipe produces; `sp_validation` consumes; `cs_util` is shared (see - `project_stack_division`). -- Rho stats are obsolete here — sp_validation/cosmo_val took over (see - `project_rho_stats_obsolete`). -- Royal "we" in PR/issue voice; specific findings attributed to Claude - by name (see `feedback_writing_voice_on_cails_behalf`). +- **[[shapepipe/ci-green-on-develop]]** / **[[shapepipe/test-suite]]** — a + tiered, in-image test suite and trustworthy CI on `develop`. +- **[[docker-uv-revert]]** — slim Python base + uv lockfile, dropping conda. +- **[[shapepipe/mpi-hybrid]]** — running hybrid MPI through the container on candide. +- **[[ngmix-update]]** — replacing the pinned ngmix fork with upstream. diff --git a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md index 29ab2f689..629d6c23d 100644 --- a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md +++ b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md @@ -64,7 +64,7 @@ just CI. Deserves its own issue; #732 doesn't touch it. ## Knock-on -[[shapepipe/prs-in-flight]]: **#729** (actions group, bumps `setup-miniconda` +**#729** (actions group, bumps `setup-miniconda` v3→v4) hit the layer-1 failure too — confirming the action bump alone doesn't fix the path. #729 must rebase on top of #732 once it merges before it can go green. The smoke-test work in [[shapepipe/smoke-test-read-only]] diff --git a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md index 63e39c445..4083ae1b8 100644 --- a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md +++ b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md @@ -1,25 +1,32 @@ --- name: 'ShapePipe cleanup: remove obsolete rho-stats/stile; modernize candide job scripts' -status: open +status: closed tags: - shapepipe - cleanup - constitution created-at: 2026-05-30T21:45:50.977369486+02:00 +closed-at: 2026-05-31T12:53:30.382233194+02:00 outcome: |- - Done — two PRs open against develop, neither merged (Martin reviews). (1) PR - #736: removed the in-ShapePipe PSF-systematics plotting path (mccd_plots_runner - + mccd_plot_utilities — the "rho statistics" were only a docstring promise, the - code computed mean shapes + histograms, no treecorr/stile). Two sole-purpose - configs deleted whole; six configs edited; docs updated to point PSF diagnostics - at sp_validation/cosmo_val. In-image pytest 250 passed; CI green. (2) PR #737: - candide_smp.sh / candide_mpi.sh now run via apptainer + the runtime image, no - conda; SMP verified end-to-end on c03 (0 errors), MPI hybrid pattern written but - needs a real allocation to verify (hangs on login node). Two scope findings: - `stile` was already vestigial (zero refs anywhere — nothing to remove); - `random_cat` was KEPT — it is a general LSS random-catalogue generator, not part - of the rho-stats path, so deleting it would overreach what Martin flagged. - canfar + ccin2p3 (cc_*.sh) scripts left untouched and noted in PR #737. + Resolved as one shipped PR + one corrected mis-scope. + + D1 (rho-stats removal) was a STALE PREMISE: the rho-stats/stile/treecorr code was + already surgically removed from develop in #715 (merged 2026-04-23). What remained + in `mccd_plots_runner.py` / `mccd_plot_utilities.py` is pure meanshapes/ellipticity + plotting — NOT rho-stats — and Martin explicitly asked to keep it on #715 ("Let's + keep meanshapes, this is very useful... can be run on merged star and PSF catalogues"). + PR #736 was opened then CLOSED (not merged): deleting meanshapes would contradict + Martin and risk a catalogue-paper figure path. `stile` was already gone everywhere. + Lesson: verify the premise against current develop before cutting the branch. + + D2 (candide PBS scripts) SHIPPED as PR #737 — OPEN, CI green, mergeable, awaiting + Martin's review. candide_smp.sh / candide_mpi.sh now run via `apptainer exec` against + ghcr.io/cosmostat/shapepipe:develop-runtime (no conda); host-clone bind-mounted at the + same path so $SPDIR-relative configs resolve identically in/out of container; MPI uses + the hybrid host-mpiexec pattern. Tested on c03=candide: SMP runs the example pipeline + end-to-end with 0 errors; MPI hybrid needs a real multi-node allocation to verify e2e. + canfar + ccin2p3 scripts deliberately untouched (different clusters, can't verify here) + and noted in the PR. Also fixed a stale config path and propagated the real exit code. shuttle: enabled: true kind: oneshot @@ -27,9 +34,9 @@ shuttle: project_dir: /automnt/n17data/cdaley/unions/shapepipe agent: claude-opus session: - id: 30ae76cc-6d3d-4773-827f-b6505ca7f3e9 + id: f1758ecc-bf5f-452c-9f92-6393adebe65e agent: claude-opus - dispatched_at: 2026-05-30T19:52:16.666358713Z + dispatched_at: 2026-05-31T10:51:28.745315935Z --- ## Desired State @@ -130,30 +137,7 @@ green on each, neither merged, canfar untouched-and-noted. don't touch); the broader test-suite work (separate, done); any scientific-algorithm change. -## Resolution - -**`random_cat` is NOT rho-stats-only — kept.** Empirically: `random_cat_package` -generates a random catalogue of points within the survey mask (healpix output, -`config_Rc.ini`, authored by Martin). That is a general clustering / LSS tool -(Landy-Szalay-style randoms), independent of PSF systematics. Rho statistics are -auto/cross-correlations of PSF-ellipticity residuals at *star* positions and need -no random catalogue. So the conditional in Deliverable 1 ("remove after confirming -random_cat is rho-only") failed its precondition → random_cat stays. Removing it -would overreach what Martin called obsolete; flagged in PR #736 for a follow-up if -he does want it gone. - -**`stile` was already vestigial.** Zero references in `src/`, `example/`, `docs/`, -`pyproject.toml`, `uv.lock` — never a declared dep, never imported. Nothing to -remove; the absence is noted in PR #736 so it isn't read as an oversight. - -**The "rho-stats path" was really PSF-diagnostic plotting.** `mccd_plot_utilities` -imports only matplotlib/mccd/numpy/astropy — it computes mean shapes and ellipticity -histograms, no `treecorr`/`stile` correlation functions. The docstring's "rho -statistics plot" was aspirational. Maps cleanly to Martin's "rho stats within -shapepipe" = the PSF-leakage diagnostics now owned by `shear_psf_leakage`. - -**MPI verification gap.** `candide_smp.sh` is verified end-to-end through the -container on c03. `candide_mpi.sh` uses the documented hybrid Apptainer pattern -(host `mpiexec` launches container ranks) but can't be verified on a login node — -`mpiexec` inside the container hangs without a PMIx allocation. ABI note for the -reviewer: image ships OpenMPI 4.1.4, candide modules are now OpenMPI 5.0.x. +## Open Questions + +- Is `random_cat` truly rho-stats-only, or does any non-rho config/use depend on + it? Confirm before deleting it (vs. just `mccd_plots_runner`). diff --git a/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md new file mode 100644 index 000000000..2bfc76d5c --- /dev/null +++ b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md @@ -0,0 +1,84 @@ +--- +name: 'ShapePipe execution modes (smp/mpi) and schedulers (PBS/SLURM): what the repo''s tooling shows' +tags: + - shapepipe + - mpi + - reference +created-at: 2026-05-31T16:51:46.221097637+02:00 +outcome: 'By the repo''s lights SMP is the exercised path (55/56 example configs; every canfar/candide job script is SMP-only via N_SMP, SLURM+conda); MPI is the 2019 mode, set in 1 config, and its code/config drifted out of sync (module_config_sec bug dates to #415 by git history). PBS is dead (2019 example scripts only); SLURM is current everywhere. CAVEAT: this is what the repo shows, not how ShapePipe was actually run — canfar carried most processing and is invisible from here, so MPI usage history is unknown.' +--- + +Two orthogonal axes that are easy to conflate when reasoning about how ShapePipe +runs on a cluster. This fiber pins down what each is, when it entered, and what's +actually used today vs. legacy — the context for [[shapepipe/mpi-hybrid]]. + +## Axis 1 — execution mode (`[EXECUTION] MODE`, inside ShapePipe) + +Dispatched in `src/shapepipe/run.py`: `mode = config["EXECUTION"]["MODE"].lower()`, +then `run_mpi(pipe, comm)` if `mode == "mpi"` else `run_smp(pipe)`. If mpi4py isn't +importable, mode is forced to `smp`. + +- **`smp`** — joblib `Parallel(n_jobs=batch_size)` across cores on **one node** + (`job_handler._distribute_smp_jobs`). **The living path.** 55 of 56 example + configs set `MODE = SMP`; every canfar/candide production script drives it by + injecting `N_SMP` into the config (`SMP_BATCH_SIZE`). +- **`mpi`** — mpi4py scatter/gather across **multiple nodes** (`pipeline/mpi_run.py`, + `submit_mpi_jobs`). 2019-era (`c6554983` "initial mpi framework"). Exactly **1** + example config uses it. The `worker()` call in `mpi_run.py` has been out of sync + since PR #415 (Jan 2025) — `worker()` gained a `module_config_sec` param and + `mpi_run.py` wasn't updated, so it passes 7 args where 8 are required. On candide + it couldn't even wire up (PMIx mismatch, see [[shapepipe/mpi-hybrid]]), so the + code bug couldn't surface here. Whether MPI was run elsewhere (canfar especially, + which we can't see) is unknown — what's clear is the repo's tooling is all SMP. + +**SMP and MPI are the same computation behind two dispatchers.** Both call the +identical `WorkerHandler.worker()` with the identical 8 args (`job_handler._distribute_smp_jobs` +vs `mpi_run.submit_mpi_jobs`). The MPI path's only inter-rank traffic is `bcast` +of setup objects, one `scatter` of the independent job-list, and one `gather` of +result dicts — `worker_handler.py` (the actual work) has zero MPI in it. No +`Send`/`Recv`/`Allreduce`/`Barrier` during compute. That's the signature of an +**embarrassingly parallel** workload: MPI provides no computational capability +that SMP-on-a-node-plus-a-scheduler lacks — it's a job-distribution convenience +(one `mpirun` spanning nodes vs. the submission layer fanning out per-node jobs). +This is what grounds the "is MPI worth keeping?" question to Martin — observed +from the comm pattern, not inferred from usage. + +Note `MODE` is overloaded across config sections — `CLASSIC`, `MULTI-EPOCH`, +`FIT_VALIDATION`, `VALIDATION` are *module* modes (PSF / ngmix), not `[EXECUTION]` +modes. Only `smp`/`mpi` live under `[EXECUTION]`. + +## Axis 2 — scheduler (the batch wrapper, outside ShapePipe) + +- **PBS** (`#PBS` / `qsub`) — the 2019 `example/pbs/` scripts. **Dead** on candide + (migrated to SLURM). All `#PBS` directives removed on the #737 branch. +- **SLURM** (`#SBATCH` / `sbatch`) — **current everywhere**. canfar since ~2020, + candide since 2024. + +## What the dates and tooling show + +The maintained submission tooling is SMP-only and SLURM-based: `scripts/sh/run_scratch_local.sh` +(2024-11, *"submit jobs on candide"*) → `init_run_exclusive_canfar.sh` → `job_sp_canfar.bash`, +all `sbatch`, all **SMP** via `N_SMP` ("SMP mode only" in their help), and still **conda** +(`CONDA_PREFIX=$HOME/.conda/envs/shapepipe`), *not* the container. The `example/pbs/candide_{smp,mpi}.sh` +scripts are 2019 **teaching examples** (untouched until the #737 branch). + +This is evidence about the tooling, not a claim about run history. It's suggestive — the +SMP tooling is what's been maintained, the MPI mode and its example config drifted untouched — +but most processing ran on canfar, which isn't visible from this repo, so how much MPI was +actually used is a question for the people who ran it, not something the repo can answer. + +## Implications + +- The MPI fix is worth landing — `mpi` is a supported mode and getting it working through + the container on candide was the point — framed as enablement/verification, not as + unblocking some known-active workload. +- Production scripts (SMP + SLURM + conda) are untouched by #737 and out of scope; they're + also **not yet containerized** — a future gap to name. +- **Decision deferred to Martin (asked in #737):** is MPI worth getting working / + maintaining on candide at all, or should candide just use SMP (which works through + the container — `candide_smp.sh`)? Given SMP and MPI are the same computation, MPI + earns its keep only as an ergonomic convenience. We do *not* retire it unilaterally — + it's a documented public mode; #737 leaves it in working order and Martin makes the + call. If kept, add a CI smoke so it can't silently rot again; if dropped, removal is + clean and contained (`mpi_run.py`, `run_mpi`, the `import_mpi` branches, `mpi4py`, + `candide_mpi.sh`). diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md new file mode 100644 index 000000000..8e436c30d --- /dev/null +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -0,0 +1,256 @@ +--- +name: ShapePipe hybrid MPI through the container on candide +status: active +tags: + - shapepipe + - mpi + - container + - candide +created-at: 2026-05-31T12:22:50.017370879+02:00 +outcome: |- + THREE layers of MPI bit-rot, all fixed, verified e2e on candide via the unmodified + candide_mpi.sh against the published image (job 780660: 4 ranks/2 nodes, all 3 modules, + 0 errors, real exit 0). (1) LAUNCHER: container shipped OpenMPI 4.1.4/PMIx2 vs candide + host 5.0.x/PMIx5 → hybrid MPI gave N rank-0 singletons. Fixed by building OpenMPI 5.0.8 + from source in the image (--disable-dlopen, bundled PMIx5/PRRTE), dropping libopenmpi-dev, + keeping the mpi4py wheel (uv.lock untouched); SLURM-ified candide scripts; CI publishes on + every branch push. (2) SHAPEPIPE CODE: with ranks wired up, shapepipe_run hit "worker() + missing module_runner" — latent since #415 (mpi_run.py never updated when worker() gained + module_config_sec). Fixed in e5999733. (3) STALE CONFIG: config_mpi.ini used pre-2020 module + names without the _runner suffix → "No module named python_example" + a 5-min deadlock. + Fixed in 7e7b7448. All three drifted undetected because the repo's exercised path is SMP, + not MPI ([[shapepipe/exec-modes-schedulers]]); actual MPI run history (esp. canfar) is + unknown from here. HARDENING PASS: KEPT a swallowed-exit-code fix (33494d74: main() now returns + run()'s value — every caught error had been exiting 0, broad + unrelated to MPI). PROTOTYPED + then PULLED a singleton preflight guard (check_mpi_world: abort when OMPI_COMM_WORLD_SIZE != + COMM_WORLD size — SLURM_NTASKS unreliable) — verified working but removed as scope creep on a + maybe-retired mode; recipe parked in Layer 4. STILL OPEN: rank-0 mid-setup deadlock. REMAINING: + Martin review + merge of #737; sharpened question — is MPI a used dependency at all? (hard + mpi4py dep, 2 example scripts, 1 config, 0 production paths). +--- + +## The problem + +The "MPI verification gap" flagged in [[shapepipe/cleanup-rhostats-jobscripts]]: +PR #737's `candide_mpi.sh` uses the correct Apptainer **hybrid** pattern (host +`mpirun` launches one container rank per task) but couldn't be verified, and the +container/host OpenMPI versions had drifted apart. + +Goal: actually run ShapePipe through the container under MPI on candide, end to +end, following [Apptainer's MPI guidance](https://apptainer.org/docs/user/main/mpi.html). + +## What the data said + +Empirical test on candide (image = `ghcr.io/cosmostat/shapepipe:develop-runtime`, +host `module load openmpi/5.0.8`, single node, 4 ranks): + +``` +mpirun -n 4 apptainer exec $SIF python -m mpi4py.bench helloworld + → Hello, World! I am process 0 of 1 on n23. (×4) +``` + +Four singletons instead of one 4-rank job. Apptainer's docs name this exactly: +*"If your containers run N rank 0 processes … the MPI stack used to launch is not +compatible with the MPI stack in the container."* + +**Root cause — PMIx wire mismatch.** The hybrid model needs the container's MPI +to speak the same PMIx as the host launcher. + +| | OpenMPI | PMIx | +|---|---|---| +| container (Debian bookworm `libopenmpi-dev`) | 4.1.4 | 2.x (`MCA pmix: ext3x`, `--with-pmix=.../pmix2`) | +| candide host (`openmpi/5.0.8`) | 5.0.8 | 5.x (internal) | + +PMIx 2 client cannot connect to the PMIx 5 server PRRTE stands up, so each rank +initializes standalone. (`libmpi.so.40` is ABI-stable across OpenMPI 4↔5, which +is why mpi4py *imports* fine — but import isn't wire-up.) + +## The fix + +Build **OpenMPI 5.0.x from source** in the image (bundled PMIx 5 / PRRTE, +`--with-pmix=internal --with-prrte=internal --with-hwloc=internal +--with-libevent=internal --disable-dlopen`). The stock mpi4py wheel (from +uv.lock) dlopens `libmpi.so.40`, the soname this build provides, so it needs +**no rebuild** and `uv.lock` stays a pure SSOT. `--disable-dlopen` links MCA +components statically — it both fixes an internal-openpmix `pdl` configure +failure (wants libltdl headers otherwise) and is the right posture for a +container (no dlopen of plugin .so across the SIF/bind boundary). + +Proven locally on candide before committing: a minimal proof container compiled +OpenMPI 5.0.8 + built mpi4py clean, and the `--disable-dlopen` flag was found by +iterating the configure step. Then switched to the **build-remotely / pull- +locally** loop (now in CLAUDE.md): edit Dockerfile → push → CI builds and +publishes to GHCR → `apptainer pull` on the cluster → test. Local `apptainer +build` is the wrong default — cluster quotas are tight (hit `disk quota +exceeded` on `$HOME`; keep SIFs + `APPTAINER_TMPDIR`/`CACHEDIR` on a data +partition). CI now publishes on every branch push (not just integration +branches) so any PR has a pullable, cluster-testable image before merge. + +## Keeping host ↔ container MPI in sync (design) + +The container seals off the host's userspace *except* MPI — to use the +interconnect + launcher you need the in-image MPI to cooperate with host +machinery you can't seal off. The contract is narrower than "same version": +what must match is the **PMIx wire protocol** and **launch mechanism**, and +PMIx is compatible *within a major version*. So the compatibility unit is the +**5.0.x series**, not the point release — hence `module load openmpi` (default) +in the job script and `OMPI_VERSION` as a Docker `ARG` (retarget = one number). + +Spectrum for multi-cluster / differing-MPI futures, cheapest → most robust: +1. **Pin a series + track targets** (chosen). One image covers every PMIx-5 + cluster. Most modern HPC is here now. +2. **CI matrix → variants** from the same build-arg (`:…-ompi5`, `:…-ompi4`) + when two targets straddle a PMIx major. One source, N artifacts. +3. **Bind model** (`--bind $MPI_DIR`): no MPI baked, host MPI mounted in — + always matches but fragile (glibc/path/admin-bind caveats). Fallback. +4. **Wi4MPI** (a CEA tool): MPI translation layer, write-once-run-anywhere + across MPI families. Heaviest; the escalation if 1–2 don't suffice. +5. **Preflight self-check** (complements any): run a 2-rank helloworld, detect + the "rank 0 of 1" singleton signature, fail loudly instead of silently + running N independent copies → wrong science. Recommended regardless; turns + silent desync into an obvious error. Not yet implemented — candidate for + this PR or a follow-up. + +## Environment facts (candide, 2026-05) + +- **Scheduler is SLURM**, not PBS — `qsub`/`qstat` are gone; partitions `comp` + (2-day) / `compl` (5-day), idle nodes available. The `#PBS` directives in the + candide job scripts are dead. +- **Host OpenMPI**: modules `openmpi/5.0.3`–`5.0.10`, built `-slurm-CentOS8` + (`/softs/openmpi/5.0.8-slurm-CentOS8`). The 4.0.5 the old script loaded is gone. +- **srun launch is not viable** for OpenMPI 5 here: `srun --mpi=list` → + none/cray_shasta/pmi2 only (no pmix). Use `mpirun` (PRRTE carries PMIx). +- **Local container builds work** via `apptainer build --fakeroot` even without + `/etc/subuid` entries (root-mapped namespace; `allow setuid = yes`). + +## Deliverables (on #737 branch `cleanup/candide-scripts-container`) + +All committed (`4fc948db` MPI fix, `d31d4d26` CI), pushed, CI building. Going +onto the existing #737 PR rather than a new one — this completes the candide- +scripts work #737 started. + +1. **Dockerfile** → OpenMPI 5.0.8 from source, `--disable-dlopen`; libopenmpi-dev + dropped; mpi4py wheel kept (uv.lock untouched). +2. **candide job scripts** → SLURM (`#SBATCH`), `module load openmpi` (default), + `mpirun -n $SLURM_NTASKS apptainer exec … shapepipe_run`. + (`example/pbs/config_mpi.ini` already existed and is correct.) +3. **docs / CLAUDE.md** — hybrid-MPI run pattern; build-remotely/pull-locally loop. +4. **CI** — publish on every branch push so PR images are cluster-testable. +5. **ShapePipe MPI code fix** (`e5999733`) — thread `module_config_sec` through + `run_mpi`/`submit_mpi_jobs`/`worker()`; the latent #415 bug surfaced once the + launcher worked. Shipped in the published image (CI rebuild). +6. **Stale example config fix** (`7e7b7448`) — `config_mpi.ini` module names + `*_runner`-suffixed to match the loader; surfaced running the real script. +7. **Exit-code propagation fix** (`33494d74`) — `main()` returns `run()`'s value; + every caught error had been exiting 0. + regression test. + +Pulled from this PR (parked follow-up, gated on MPI being kept): the +`check_mpi_world()` singleton preflight guard — prototyped + verified, recipe in +Layer 4 above. + +## Empirical close (2026-05-31) — two layers + +The fix turned out to have **two independent layers**. The launcher fix +(above) was necessary but not sufficient: making the ranks actually wire up +exposed a second, latent bug in ShapePipe's own MPI code. + +**Layer 1 — launcher (PMIx), verified.** Pulled the PR image on candide and +ran the rank wire-up check (2 nodes, 4 tasks, `module load openmpi` → `mpirun +-n 4 apptainer exec … python -m mpi4py.bench helloworld`): + +``` +Hello, World! I am process 0 of 4 on n23. +Hello, World! I am process 1 of 4 on n23. +Hello, World! I am process 2 of 4 on n25. +Hello, World! I am process 3 of 4 on n25. +``` + +One 4-rank job spanning two nodes — the exact inverse of the pre-fix 4× +"rank 0 of 1". Image reports `Open MPI: 5.0.8`. ✓ + +**Layer 2 — ShapePipe MPI code, was broken, now fixed.** With the ranks wired +up, the actual `shapepipe_run` under MPI immediately hit: + +``` +ERROR: WorkerHandler.worker() missing 1 required positional argument: 'module_runner' +``` + +By git history this dates to PR #415: `worker()` gained a `module_config_sec` +parameter and `pipeline/mpi_run.py:submit_mpi_jobs` wasn't updated in step, so +it passes 7 args where 8 are required. On candide this path wasn't reachable +until the launcher fix (PMIx never let MPI start here), so it couldn't surface +on this cluster before. How much MPI has actually been exercised elsewhere — +canfar especially, which we can't see from here — is unknown; what we can say +is the repo's tooling points entirely at SMP (see +[[shapepipe/exec-modes-schedulers]]). Fixed by threading `module_config_sec` +through `run_mpi` → `submit_mpi_jobs` → `worker()` (commit `e5999733`), +matching the SMP/serial call sites. + +Verified with a host-src override (job 780655): fixed `submit_mpi_jobs` +signature live in-container, 4 ranks across n23+n25, all three modules +produced output, real `RUN_EXIT=0`, 0 errors. + +**Layer 3 — stale example config, now fixed.** With the code fix baked into +the published image, the *actual* unmodified `candide_mpi.sh` against +`config_mpi.ini` first hit `No module named 'shapepipe.modules.python_example'` +then deadlocked to the 5-min wall clock. `config_mpi.ini` (last touched 2020) +still used the pre-suffix module names (`python_example`, `[PYTHON_EXAMPLE]`); +the loader needs the full runner names (`python_example_runner`, +`[PYTHON_EXAMPLE_RUNNER]`), as `example/config.ini` uses. Updated to match +(commit `7e7b7448`). Same flavour as Layers 1–2: the MPI path's tooling and +example config drifted out of sync with the rest of the repo, undetected, +because the repo's exercised path is SMP, not MPI. + +## Layer 4 — silent-failure hardening (the "warning sign") + +A deeper pass on the singleton failure (option 5 in the spectrum above) turned +up two more silent-failure paths. One was kept; one was prototyped, verified, +then deliberately pulled back out (see below). + +**(a) Swallowed exit code — KEPT (`33494d74`).** `main()` in `shapepipe_run.py` +called `run(args)` without returning it, so `exit(main())` was always +`exit(None)` → 0. **Every caught error in ShapePipe — not just MPI — has been +exiting 0**, invisible to `exit $?` and CI. Fixed to `return run(args)` + +regression test. Broad, simple, unrelated to MPI's fate, so it stays. + +**(b) Singleton preflight guard — PROTOTYPED, then PULLED (`2289e6a7` reverted).** +In the singleton case every process is master, `split_mpi_jobs(list, 1)` hands +each the *full* job list, and they all run the whole pipeline into the same +output dir — N uncoordinated copies, exit 0, plausible-but-wrong. The exit-code +fix does **not** catch this: singletons don't raise, they "succeed" wrongly. A +`check_mpi_world()` preflight was written and verified on a real allocation +(healthy passes; OMPI-4-under-OMPI-5-host fires + exits non-zero). It was then +removed from #737 as scope creep: the failure is already designed out on candide +by the OpenMPI-5 match, it adds a runtime check to core `run.py`, and MPI's +future is an open question (a hard `mpi4py` dependency used by only 2 example +scripts — candide + ccin2p3 `cc_mpi.sh` — 1 config, and 0 production paths). +**Recipe, if MPI is kept:** at the top of `run_mpi`, abort when +`int(os.environ["OMPI_COMM_WORLD_SIZE"]) != comm.Get_size()`. The hard-won part +is that signal choice — **`SLURM_NTASKS` is NOT usable** (reads `1` on +remote-node ranks even when healthy); `OMPI_COMM_WORLD_SIZE` is `4` in both +healthy and singleton, only `COMM_WORLD` differs. + +**Still open (distinct gap):** when rank 0 fails *mid-setup* for a non-singleton +reason (e.g. the stale-config module error in Layer 3), ranks 1..N block in the +following `bcast`/`scatter` until the wall clock — the guard runs *before* module +loading, so it doesn't cover this. Fixing it needs collective error propagation +(rank 0 signalling failure before the barrier). Left as a follow-up. + +**Genuinely verified end to end** (job 780660): the unmodified `candide_mpi.sh` +against the freshly-published `:cleanup-candide-scripts-container-runtime` image +(fix baked in, no override) ran the example pipeline — 4 ranks / 2 nodes, all +three `*_example_runner` modules produced output trees, *"A total of 0 errors +were recorded"*, real exit 0 (the script's `exit $?`). The deliverable script +itself works. + +> Correction: an earlier close claimed the full pipeline ran clean before any +> code fix. It did not — that run hit the Layer-2 error and the sbatch script's +> `RUN_EXIT=0` was a hardcoded `echo`, not the real exit code. The launcher half +> was real; the pipeline half was not, until the fixes above. + +**Remaining:** Martin's review + merge of #737. + +(Note: the in-image `mpi4py` import looks absent under `bash -lc` because the +login shell resets PATH off the venv — a probe artifact, not real; the actual +`mpirun apptainer exec python -m mpi4py.bench` run resolves it via the image's +default PATH and wires up fine, as the helloworld output shows.) diff --git a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md index cba960b3c..b9bbe8849 100644 --- a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md +++ b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md @@ -67,5 +67,4 @@ both the runtime and dev target blocks. Sits in the same family as [[shapepipe/docker-multistage]] (which introduced the runtime/dev split) and [[shapepipe/docker-uv-revert]] -(which moved uv writable targets to `/tmp` via env vars). [[shapepipe/prs-in-flight]] -gets a new "in-flight" entry once the PR is up. +(which moved uv writable targets to `/tmp` via env vars). diff --git a/.github/workflows/deploy-image.yml b/.github/workflows/deploy-image.yml index b3b0dc54e..8f6c4a47c 100644 --- a/.github/workflows/deploy-image.yml +++ b/.github/workflows/deploy-image.yml @@ -3,16 +3,21 @@ name: Docker image — build, test, publish # Single source of truth for ShapePipe's environment is the Dockerfile # (slim Python + apt system deps + uv-frozen wheels). This workflow builds # that image, runs the test suite *inside it* — so CI tests exactly what -# ships — and publishes to ghcr only on pushes to the integration branches. +# ships — and publishes to ghcr. # -# pull_request → build + test, no publish (also works for fork PRs) -# push → build + test + publish (:develop, :latest, …) +# pull_request → build + test, no publish (covers fork PRs, which have no +# registry token) +# push (any branch) → build + test + publish, tagged with the branch name +# (e.g. :develop, :my-feature, and the -runtime variants) +# +# Publishing on every branch push — not just the integration branches — means +# any open PR has a pullable image (`apptainer pull …:-runtime`) that +# can be tested on a real cluster *before* merge. Same-repo branch pushes always +# carry a registry-write token, so this is safe; fork PRs still only build+test. on: push: branches: - - develop - - main - - master + - '**' pull_request: branches: - develop @@ -121,7 +126,8 @@ jobs: docker run --rm -e HYPOTHESIS_PROFILE=ci "$IMAGE" pytest -rX # ---------------------------------------------------------------- - # Publish (push events only — never on pull_request, incl. forks) + # Publish (push events only — never on pull_request, incl. forks). + # Fires on any branch; the image is tagged with the branch name. # ---------------------------------------------------------------- - name: Log in to the Container registry if: github.event_name == 'push' diff --git a/.gitignore b/.gitignore index 756386e12..3097dc78a 100644 --- a/.gitignore +++ b/.gitignore @@ -140,3 +140,5 @@ code .felt/index.db .felt/index-sync.lock .felt/index-sync.request +.felt/index.db-shm +.felt/index.db-wal diff --git a/CLAUDE.md b/CLAUDE.md index 17b7c90b8..723439ace 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -36,6 +36,17 @@ way to get all of that is the container. sandbox with a host clone of the repo bind-mounted in and `pip install -e` pointed at it, so edits on the host are live inside the container. +**Testing container changes: build remotely, pull locally.** Don't +`apptainer build` images on a cluster — quotas are tight and the build is slow. +The loop for any change to `Dockerfile` / `pyproject.toml` / `uv.lock` is: edit +→ push → let GitHub Actions build and publish to GHCR → `apptainer pull +docker://ghcr.io/cosmostat/shapepipe:[-runtime]` on the cluster → test. +Watch the remote build with `gh run watch` (or `gh run list --branch `). +The only things that run locally are the pull and the test. On a quota-limited +cluster, keep SIFs and Apptainer's scratch off `$HOME`: point +`APPTAINER_TMPDIR` / `APPTAINER_CACHEDIR` at a roomy data partition and pull +SIFs there. + Full detail: `docs/source/installation.md` and `docs/source/container.md`. ## Layout diff --git a/Dockerfile b/Dockerfile index a98507a32..a5b4b55f4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -39,6 +39,9 @@ ENV SHELL=/bin/bash \ # - compilers and dev libs needed to build the heavier wheels (galsim, # mpi4py, python-pysap, fitsio). # - libgl1, proj, fftw at runtime for skyproj/PyQt5/galsim. +# OpenMPI is deliberately NOT installed from Debian here — bookworm ships +# OpenMPI 4.1.4 / PMIx 2.x, which breaks hybrid MPI on modern clusters. It is +# built from source in the next stanza; see there for the full reasoning. RUN apt-get update -y --quiet && \ apt-get install -y --no-install-recommends \ build-essential \ @@ -50,12 +53,45 @@ RUN apt-get update -y --quiet && \ libfftw3-dev libfftw3-bin \ libgsl-dev \ libcfitsio-dev \ - libopenmpi-dev openmpi-bin \ libproj-dev proj-bin \ libgl1-mesa-glx \ psfex source-extractor weightwatcher && \ apt-get clean && rm -rf /var/lib/apt/lists/* +# OpenMPI from source — required for hybrid Apptainer MPI on HPC clusters. +# +# On a cluster ShapePipe runs as a standard Apptainer "hybrid" MPI job: the +# host's `mpirun` launches one container rank per slot, and the OpenMPI inside +# the image wires the ranks together through PMIx. That handshake requires the +# container's PMIx to be compatible with the host launcher's. Debian bookworm's +# package is OpenMPI 4.1.4 with PMIx 2.x; modern clusters (e.g. candide) now run +# OpenMPI 5.0.x with PMIx 5.x, and a PMIx 2 client cannot talk to a PMIx 5 +# server — so every rank silently degrades to a standalone "rank 0 of 1" and the +# job runs N independent copies instead of one N-rank job. Building OpenMPI +# 5.0.x here (with its bundled PMIx 5 / PRRTE) matches those hosts; the 5.0.x +# series is mutually PMIx-compatible, so this image works against any host +# openmpi/5.0.x module. The stock mpi4py wheel (from uv.lock) dlopens +# libmpi.so.40, the soname this build provides, so it needs no rebuild. +# +# --disable-dlopen links every MCA component statically into libmpi / libpmix: +# it sidesteps an internal-openpmix configure failure (the `pdl` component wants +# libltdl headers otherwise) and is the right posture for a container anyway — +# no fragile runtime dlopen of plugin .so files across the SIF / bind boundary. +ARG OMPI_VERSION=5.0.8 +ARG OMPI_SERIES=v5.0 +RUN cd /tmp && \ + wget -q "https://download.open-mpi.org/release/open-mpi/${OMPI_SERIES}/openmpi-${OMPI_VERSION}.tar.bz2" && \ + tar xjf "openmpi-${OMPI_VERSION}.tar.bz2" && \ + cd "openmpi-${OMPI_VERSION}" && \ + ./configure --prefix=/opt/ompi \ + --with-pmix=internal --with-prrte=internal \ + --with-hwloc=internal --with-libevent=internal \ + --disable-dlopen --disable-sphinx && \ + make -j"$(nproc)" && make install && \ + cd / && rm -rf /tmp/openmpi-* +ENV PATH="/opt/ompi/bin:${PATH}" \ + LD_LIBRARY_PATH="/opt/ompi/lib:${LD_LIBRARY_PATH}" + # uv — fast reproducible Python deps installer. pyproject.toml + uv.lock # are the SSOT; `uv sync --frozen` installs exactly what uv.lock specifies, # so upstream changes only land when we deliberately regenerate the lockfile. diff --git a/example/pbs/candide_mpi.sh b/example/pbs/candide_mpi.sh index 0abbbb7f4..4eeef4a51 100644 --- a/example/pbs/candide_mpi.sh +++ b/example/pbs/candide_mpi.sh @@ -1,40 +1,50 @@ #!/bin/bash - -########################## -# MPI Script for CANDIDE # -########################## - -# Receive email when job finishes or aborts -## #PBS -M @cea.fr -## #PBS -m ea - -# Set a name for the job -#PBS -N shapepipe_mpi - -# Join output and errors in one file -#PBS -j oe - -# Set maximum computing time (e.g. 5min) -#PBS -l walltime=00:05:00 - -# Request number of cores (e.g. 2 from 2 different machines) -#PBS -l nodes=2:ppn=2 - -# Full path to environment -export SPENV="$HOME/.conda/envs/shapepipe" - -# Full path to example config file and input data -export SPDIR="$HOME/shapepipe" - -# Load modules -module load intelpython/3 -module load openmpi/4.0.5 - -# Activate conda environment -source activate $SPENV - -# Run ShapePipe using full paths to executables -$SPENV/bin/mpiexec --map-by node $SPENV/bin/shapepipe_run -c $SPDIR/example/config_mpi.ini - -# Return exit code -exit 0 +# +# Hybrid Apptainer MPI job for candide (SLURM). +# +# ShapePipe runs as a standard Apptainer "hybrid" MPI job: the host `mpirun` +# launches one container rank per SLURM task, and the OpenMPI + mpi4py inside +# the image handle the communication. For the ranks to find one another, the +# container's OpenMPI must speak the same PMIx as the host launcher -- the +# published image ships OpenMPI 5.0.x to match candide's OpenMPI 5.0.x modules. +# (An OpenMPI 4 image silently degrades to N independent "rank 0 of 1" +# processes.) +# +# Submit with: sbatch candide_mpi.sh + +#SBATCH --job-name=shapepipe_mpi +#SBATCH --partition=comp +#SBATCH --nodes=2 +#SBATCH --ntasks=4 +#SBATCH --ntasks-per-node=2 +#SBATCH --time=00:05:00 +#SBATCH --output=%x-%j.log +## #SBATCH --mail-type=END,FAIL +## #SBATCH --mail-user=@cea.fr + +# Path to the local ShapePipe clone (holds the example configs and data). +export SPDIR="${SPDIR:-$HOME/shapepipe}" + +# Path to the ShapePipe runtime image. Pull it once with: +# apptainer pull "$SP_IMAGE" docker://ghcr.io/cosmostat/shapepipe:develop-runtime +export SP_IMAGE="${SP_IMAGE:-$HOME/shapepipe_develop-runtime.sif}" + +# Host MPI. The image ships OpenMPI 5.0.x, and any host OpenMPI in the 5.0.x +# family is PMIx-compatible with it, so the cluster default is fine. If candide's +# default ever moves to a different major series, pin a 5.0.x here instead +# (`module load openmpi/5.0.x`) to keep the host/container PMIx match. +module load openmpi + +# `mpirun` inherits the node / task layout from the SLURM allocation; -n is the +# total task count. The clone is bind-mounted at the same path so that $SPDIR +# resolves identically inside the container, where the config references it for +# the input and output directories. +mpirun -n "$SLURM_NTASKS" \ + apptainer exec \ + --bind "$SPDIR:$SPDIR" \ + --env SPDIR="$SPDIR" \ + "$SP_IMAGE" \ + shapepipe_run -c "$SPDIR/example/pbs/config_mpi.ini" + +# Propagate the pipeline's exit code to the batch system. +exit $? diff --git a/example/pbs/candide_smp.sh b/example/pbs/candide_smp.sh index 8ad89c0f0..bb539c4d6 100644 --- a/example/pbs/candide_smp.sh +++ b/example/pbs/candide_smp.sh @@ -1,31 +1,40 @@ #!/bin/bash +# +# SMP (single-node) Apptainer job for candide (SLURM). +# +# SMP mode parallelises with joblib inside a single process across the allocated +# cores -- no host MPI is involved. Use this for single-node runs; use +# candide_mpi.sh to span multiple nodes. +# +# Submit with: sbatch candide_smp.sh -########################## -# SMP Script for CANDIDE # -########################## +#SBATCH --job-name=shapepipe_smp +#SBATCH --partition=comp +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=4 +#SBATCH --time=00:05:00 +#SBATCH --output=%x-%j.log +## #SBATCH --mail-type=END,FAIL +## #SBATCH --mail-user=@cea.fr -# Receive email when job finishes or aborts -#PBS -M @cea.fr -#PBS -m ea -# Set a name for the job -#PBS -N shapepipe_smp -# Join output and errors in one file -#PBS -j oe -# Set maximum computing time (e.g. 5min) -#PBS -l walltime=00:05:00 -# Request number of cores -#PBS -l nodes=4 +# Path to the local ShapePipe clone (holds the example configs and data). +export SPDIR="${SPDIR:-$HOME/shapepipe}" -# Full path to environment -export SPENV="$HOME/.conda/envs/shapepipe" -export SPDIR="$HOME/shapepipe" +# Path to the ShapePipe runtime image. Pull it once with: +# apptainer pull "$SP_IMAGE" docker://ghcr.io/cosmostat/shapepipe:develop-runtime +export SP_IMAGE="${SP_IMAGE:-$HOME/shapepipe_develop-runtime.sif}" -# Activate conda environment -module load intelpython/3 -source activate $SPENV +# Run ShapePipe through the container -- no Python environment to activate. The +# clone is bind-mounted at the same path so that $SPDIR resolves identically +# inside the container, where the config references it for input / output +# directories. Keep SMP_BATCH_SIZE in config_smp.ini aligned with +# --cpus-per-task above. +apptainer exec \ + --bind "$SPDIR:$SPDIR" \ + --env SPDIR="$SPDIR" \ + "$SP_IMAGE" \ + shapepipe_run -c "$SPDIR/example/pbs/config_smp.ini" -# Run ShapePipe using full paths to executables -$SPENV/bin/shapepipe_run -c $SPDIR/example/pbs/config_smp.ini - -# Return exit code -exit 0 +# Propagate the pipeline's exit code to the batch system. +exit $? diff --git a/example/pbs/config_mpi.ini b/example/pbs/config_mpi.ini index bb2b8f95d..cd41c9ea3 100644 --- a/example/pbs/config_mpi.ini +++ b/example/pbs/config_mpi.ini @@ -2,7 +2,7 @@ ## ShapePipe execution options [EXECUTION] -MODULE = python_example, serial_example, execute_example +MODULE = python_example_runner, serial_example_runner, execute_example_runner MODE = mpi ## ShapePipe file handling options @@ -15,8 +15,8 @@ OUTPUT_DIR = $SPDIR/example/output TIMEOUT = 00:01:35 ## Module options -[PYTHON_EXAMPLE] +[PYTHON_EXAMPLE_RUNNER] MESSAGE = The obtained value is: -[SERIAL_EXAMPLE] +[SERIAL_EXAMPLE_RUNNER] ADD_INPUT_DIR = $SPDIR/example/data/numbers, $SPDIR/example/data/letters diff --git a/src/shapepipe/pipeline/mpi_run.py b/src/shapepipe/pipeline/mpi_run.py index 4aa547a78..3e3684024 100644 --- a/src/shapepipe/pipeline/mpi_run.py +++ b/src/shapepipe/pipeline/mpi_run.py @@ -33,6 +33,7 @@ def split_mpi_jobs(jobs, batch_size): def submit_mpi_jobs( jobs, config, + module_config_sec, timeout, run_dirs, module_runner, @@ -58,6 +59,7 @@ def submit_mpi_jobs( w_log_name, run_dirs, config, + module_config_sec, timeout, module_runner, ) diff --git a/src/shapepipe/run.py b/src/shapepipe/run.py index fe2093a0d..8212fdfb5 100644 --- a/src/shapepipe/run.py +++ b/src/shapepipe/run.py @@ -416,6 +416,7 @@ def run_mpi(pipe, comm): # Get file handler objects run_dirs = jh.filehd.module_run_dirs module_runner = jh.filehd.module_runners[module] + module_config_sec = jh.filehd.get_module_config_sec(module) worker_log = jh.filehd.get_worker_log_name # Define process list process_list = jh.filehd.process_list @@ -423,8 +424,8 @@ def run_mpi(pipe, comm): jobs = split_mpi_jobs(process_list, comm.size) del process_list else: - job_type = module_runner = worker_log = timeout = jobs = ( - run_dirs + job_type = module_runner = worker_log = timeout = jobs = run_dirs = ( + module_config_sec ) = None # Broadcast job type to all nodes @@ -436,6 +437,7 @@ def run_mpi(pipe, comm): run_dirs = comm.bcast(run_dirs, root=0) module_runner = comm.bcast(module_runner, root=0) + module_config_sec = comm.bcast(module_config_sec, root=0) worker_log = comm.bcast(worker_log, root=0) timeout = comm.bcast(timeout, root=0) jobs = comm.scatter(jobs, root=0) @@ -445,6 +447,7 @@ def run_mpi(pipe, comm): submit_mpi_jobs( jobs, config, + module_config_sec, timeout, run_dirs, module_runner, @@ -455,7 +458,7 @@ def run_mpi(pipe, comm): ) # Delete broadcast objects - del module_runner, worker_log, timeout, jobs + del module_runner, module_config_sec, worker_log, timeout, jobs # Finish up parallel jobs if master: diff --git a/src/shapepipe/shapepipe_run.py b/src/shapepipe/shapepipe_run.py index 3cc3893d6..ceb98765e 100755 --- a/src/shapepipe/shapepipe_run.py +++ b/src/shapepipe/shapepipe_run.py @@ -15,7 +15,7 @@ def main(args=None): - run(args) + return run(args) if __name__ == "__main__": diff --git a/tests/unit/test_entrypoints.py b/tests/unit/test_entrypoints.py index 22d898d10..8008aa36f 100644 --- a/tests/unit/test_entrypoints.py +++ b/tests/unit/test_entrypoints.py @@ -47,3 +47,18 @@ def test_console_entrypoint_help_runs(entrypoint): assert result.returncode == 0, result.stderr assert "usage:" in result.stdout.lower() + + +@pytest.mark.parametrize("exit_code", [1, None]) +def test_main_propagates_run_exit_code(monkeypatch, exit_code): + """``main`` must forward ``run``'s return value. + + ``run`` returns 1 when it catches an error; if ``main`` drops that, + ``exit(main())`` becomes ``exit(0)`` and every handled failure looks like + success to the batch system. + """ + import shapepipe.shapepipe_run as entry + + monkeypatch.setattr(entry, "run", lambda args=None: exit_code) + + assert entry.main() == exit_code