From 1cef1581c2663593e2c8b6a7fb287ce9865b5a6a Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sat, 30 May 2026 22:09:58 +0200 Subject: [PATCH 01/20] Run candide PBS job scripts through the container instead of conda candide_smp.sh and candide_mpi.sh activated a personal conda env (module load intelpython/3; source activate $HOME/.conda/envs/shapepipe) and called $SPENV/bin/shapepipe_run. Convert them to run the pipeline through the published container image, matching the supported workflow (the container is the source of truth; see docs/source/container.md). - Drop the conda environment entirely. The pipeline runs via `apptainer exec` against the slim runtime image (ghcr.io/cosmostat/shapepipe:develop-runtime), pulled once to a SIF whose path is overridable via $SP_IMAGE. - Bind-mount the host clone ($SPDIR) at the same path inside the container so the example configs' $SPDIR-relative input/output directories resolve identically in- and outside the container. - MPI uses the standard "hybrid" Apptainer pattern: host mpiexec (module load openmpi) launches one container rank per slot, the in-image mpi4py/OpenMPI handle communication. - Fix a stale path: candide_mpi.sh pointed at example/config_mpi.ini, which does not exist; the file is example/pbs/config_mpi.ini. - Propagate the pipeline exit code to the batch system (exit $?) instead of always exiting 0. - Make $SPDIR overridable for testing. Tested on candide (c03): candide_smp.sh runs the SMP example pipeline end-to-end through the container with 0 errors. The MPI hybrid launch needs a real multi-node allocation to verify end-to-end (it hangs on a login node); the image's MPI stack (mpiexec + mpi4py 4.1.1) and the shared container invocation are verified via the SMP run. Co-Authored-By: Claude Opus 4.8 --- example/pbs/candide_mpi.sh | 44 ++++++++++++++++++++++---------------- example/pbs/candide_smp.sh | 26 +++++++++++++--------- 2 files changed, 42 insertions(+), 28 deletions(-) diff --git a/example/pbs/candide_mpi.sh b/example/pbs/candide_mpi.sh index 0abbbb7f4..0c50f88e6 100644 --- a/example/pbs/candide_mpi.sh +++ b/example/pbs/candide_mpi.sh @@ -20,21 +20,29 @@ # Request number of cores (e.g. 2 from 2 different machines) #PBS -l nodes=2:ppn=2 -# Full path to environment -export SPENV="$HOME/.conda/envs/shapepipe" - -# Full path to example config file and input data -export SPDIR="$HOME/shapepipe" - -# Load modules -module load intelpython/3 -module load openmpi/4.0.5 - -# Activate conda environment -source activate $SPENV - -# Run ShapePipe using full paths to executables -$SPENV/bin/mpiexec --map-by node $SPENV/bin/shapepipe_run -c $SPDIR/example/config_mpi.ini - -# Return exit code -exit 0 +# Path to the local ShapePipe clone (holds the example configs and data) +export SPDIR="${SPDIR:-$HOME/shapepipe}" + +# Path to the ShapePipe runtime image. Pull it once with: +# apptainer pull "$SP_IMAGE" docker://ghcr.io/cosmostat/shapepipe:develop-runtime +export SP_IMAGE="${SP_IMAGE:-$HOME/shapepipe_develop-runtime.sif}" + +# Load the host MPI. ShapePipe runs as a standard "hybrid" Apptainer MPI job: +# the host mpiexec launches one container rank per slot and the in-image +# mpi4py / OpenMPI handle the communication. The image ships OpenMPI 4.1.x, so +# load a host OpenMPI in the same family for ABI compatibility. +module load openmpi + +# Run ShapePipe through the container -- no Python environment to activate. The +# clone is bind-mounted at the same path so that $SPDIR resolves identically +# inside the container, where the config references it for the input and output +# directories. +mpiexec --map-by node \ + apptainer exec \ + --bind "$SPDIR:$SPDIR" \ + --env SPDIR="$SPDIR" \ + "$SP_IMAGE" \ + shapepipe_run -c "$SPDIR/example/pbs/config_mpi.ini" + +# Propagate the pipeline's exit code to the batch system +exit $? diff --git a/example/pbs/candide_smp.sh b/example/pbs/candide_smp.sh index 8ad89c0f0..ac6240afb 100644 --- a/example/pbs/candide_smp.sh +++ b/example/pbs/candide_smp.sh @@ -16,16 +16,22 @@ # Request number of cores #PBS -l nodes=4 -# Full path to environment -export SPENV="$HOME/.conda/envs/shapepipe" -export SPDIR="$HOME/shapepipe" +# Path to the local ShapePipe clone (holds the example configs and data) +export SPDIR="${SPDIR:-$HOME/shapepipe}" -# Activate conda environment -module load intelpython/3 -source activate $SPENV +# Path to the ShapePipe runtime image. Pull it once with: +# apptainer pull "$SP_IMAGE" docker://ghcr.io/cosmostat/shapepipe:develop-runtime +export SP_IMAGE="${SP_IMAGE:-$HOME/shapepipe_develop-runtime.sif}" -# Run ShapePipe using full paths to executables -$SPENV/bin/shapepipe_run -c $SPDIR/example/pbs/config_smp.ini +# Run ShapePipe through the container -- no Python environment to activate. The +# clone is bind-mounted at the same path so that $SPDIR resolves identically +# inside the container, where the config references it for the input and output +# directories. +apptainer exec \ + --bind "$SPDIR:$SPDIR" \ + --env SPDIR="$SPDIR" \ + "$SP_IMAGE" \ + shapepipe_run -c "$SPDIR/example/pbs/config_smp.ini" -# Return exit code -exit 0 +# Propagate the pipeline's exit code to the batch system +exit $? From 7a87bef5b1ecfe9cc6c05bafbc11628f20c9a6dc Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 12:53:46 +0200 Subject: [PATCH 02/20] felt: close cleanup-rhostats-jobscripts (D1 stale premise, D2 shipped as #737) Co-Authored-By: Claude Opus 4.8 --- .../cleanup-rhostats-jobscripts.md | 47 +++++++++++++------ 1 file changed, 33 insertions(+), 14 deletions(-) diff --git a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md index a95ad25d8..4083ae1b8 100644 --- a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md +++ b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md @@ -1,23 +1,42 @@ --- name: 'ShapePipe cleanup: remove obsolete rho-stats/stile; modernize candide job scripts' -status: open +status: closed tags: - - shapepipe - - cleanup - - constitution + - shapepipe + - cleanup + - constitution created-at: 2026-05-30T21:45:50.977369486+02:00 +closed-at: 2026-05-31T12:53:30.382233194+02:00 outcome: |- - Two independent cleanups, each delivered as its own PR (NOT merged to develop): (1) remove the obsolete in-shapepipe rho-stats/stile path — Martin confirmed it's superseded by sp_validation/cosmo_val — opened for Martin's review; (2) modernize the candide PBS job scripts to run via the container instead of a personal conda env, tested on candide (this host is c03=candide). The canfar job scripts are explicitly left untouched (can't verify them) and that's noted in the PR. Shuttled to Codex. + Resolved as one shipped PR + one corrected mis-scope. + + D1 (rho-stats removal) was a STALE PREMISE: the rho-stats/stile/treecorr code was + already surgically removed from develop in #715 (merged 2026-04-23). What remained + in `mccd_plots_runner.py` / `mccd_plot_utilities.py` is pure meanshapes/ellipticity + plotting — NOT rho-stats — and Martin explicitly asked to keep it on #715 ("Let's + keep meanshapes, this is very useful... can be run on merged star and PSF catalogues"). + PR #736 was opened then CLOSED (not merged): deleting meanshapes would contradict + Martin and risk a catalogue-paper figure path. `stile` was already gone everywhere. + Lesson: verify the premise against current develop before cutting the branch. + + D2 (candide PBS scripts) SHIPPED as PR #737 — OPEN, CI green, mergeable, awaiting + Martin's review. candide_smp.sh / candide_mpi.sh now run via `apptainer exec` against + ghcr.io/cosmostat/shapepipe:develop-runtime (no conda); host-clone bind-mounted at the + same path so $SPDIR-relative configs resolve identically in/out of container; MPI uses + the hybrid host-mpiexec pattern. Tested on c03=candide: SMP runs the example pipeline + end-to-end with 0 errors; MPI hybrid needs a real multi-node allocation to verify e2e. + canfar + ccin2p3 scripts deliberately untouched (different clusters, can't verify here) + and noted in the PR. Also fixed a stale config path and propagated the real exit code. shuttle: - enabled: true - kind: oneshot - host: c03 - project_dir: /automnt/n17data/cdaley/unions/shapepipe - agent: claude-opus - session: - id: 30ae76cc-6d3d-4773-827f-b6505ca7f3e9 + enabled: true + kind: oneshot + host: c03 + project_dir: /automnt/n17data/cdaley/unions/shapepipe agent: claude-opus - dispatched_at: 2026-05-30T19:52:16.666358713Z + session: + id: f1758ecc-bf5f-452c-9f92-6393adebe65e + agent: claude-opus + dispatched_at: 2026-05-31T10:51:28.745315935Z --- ## Desired State @@ -121,4 +140,4 @@ green on each, neither merged, canfar untouched-and-noted. ## Open Questions - Is `random_cat` truly rho-stats-only, or does any non-rho config/use depend on - it? Confirm before deleting it (vs. just `mccd_plots_runner`). \ No newline at end of file + it? Confirm before deleting it (vs. just `mccd_plots_runner`). From 4fc948dbe43c41f279fa37c413d2ee2fa8b23b51 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 12:57:44 +0200 Subject: [PATCH 03/20] Build OpenMPI 5.0.x in the image; SLURM-ify candide job scripts Hybrid Apptainer MPI was broken on candide: the image shipped Debian bookworm's OpenMPI 4.1.4 (PMIx 2.x) while candide's host launcher is now OpenMPI 5.0.x (PMIx 5.x). A PMIx 2 client cannot handshake with a PMIx 5 server, so every rank degraded to a standalone "rank 0 of 1" -- N singletons instead of one N-rank job (the textbook Apptainer symptom). - Dockerfile: drop libopenmpi-dev/openmpi-bin; build OpenMPI 5.0.8 from source with bundled PMIx 5 / PRRTE (--with-pmix=internal etc.) and --disable-dlopen (static MCA -- fixes an internal-openpmix pdl configure failure and is the right posture for a container). The stock mpi4py wheel dlopens libmpi.so.40, which this build provides, so uv.lock is untouched. - example/pbs/candide_{mpi,smp}.sh: candide migrated PBS -> SLURM (qsub is gone), so convert #PBS -> #SBATCH and launch with `mpirun -n $SLURM_NTASKS apptainer exec ... shapepipe_run`. Load the cluster-default `openmpi` (any 5.0.x is PMIx-compatible). - docs + CLAUDE.md: document the hybrid-MPI run pattern and the build-remotely / pull-locally container workflow. Empirically verified on candide: the 4.1.4 image gives 4x "rank 0 of 1"; an OpenMPI 5.0.8 build wires up correctly. See .felt shapepipe/mpi-hybrid. Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 11 ++++++ Dockerfile | 38 +++++++++++++++++++- docs/source/basic_execution.md | 30 +++++++++++++--- example/pbs/candide_mpi.sh | 66 +++++++++++++++++----------------- example/pbs/candide_smp.sh | 41 +++++++++++---------- 5 files changed, 130 insertions(+), 56 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 17b7c90b8..723439ace 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -36,6 +36,17 @@ way to get all of that is the container. sandbox with a host clone of the repo bind-mounted in and `pip install -e` pointed at it, so edits on the host are live inside the container. +**Testing container changes: build remotely, pull locally.** Don't +`apptainer build` images on a cluster — quotas are tight and the build is slow. +The loop for any change to `Dockerfile` / `pyproject.toml` / `uv.lock` is: edit +→ push → let GitHub Actions build and publish to GHCR → `apptainer pull +docker://ghcr.io/cosmostat/shapepipe:[-runtime]` on the cluster → test. +Watch the remote build with `gh run watch` (or `gh run list --branch `). +The only things that run locally are the pull and the test. On a quota-limited +cluster, keep SIFs and Apptainer's scratch off `$HOME`: point +`APPTAINER_TMPDIR` / `APPTAINER_CACHEDIR` at a roomy data partition and pull +SIFs there. + Full detail: `docs/source/installation.md` and `docs/source/container.md`. ## Layout diff --git a/Dockerfile b/Dockerfile index a98507a32..a5b4b55f4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -39,6 +39,9 @@ ENV SHELL=/bin/bash \ # - compilers and dev libs needed to build the heavier wheels (galsim, # mpi4py, python-pysap, fitsio). # - libgl1, proj, fftw at runtime for skyproj/PyQt5/galsim. +# OpenMPI is deliberately NOT installed from Debian here — bookworm ships +# OpenMPI 4.1.4 / PMIx 2.x, which breaks hybrid MPI on modern clusters. It is +# built from source in the next stanza; see there for the full reasoning. RUN apt-get update -y --quiet && \ apt-get install -y --no-install-recommends \ build-essential \ @@ -50,12 +53,45 @@ RUN apt-get update -y --quiet && \ libfftw3-dev libfftw3-bin \ libgsl-dev \ libcfitsio-dev \ - libopenmpi-dev openmpi-bin \ libproj-dev proj-bin \ libgl1-mesa-glx \ psfex source-extractor weightwatcher && \ apt-get clean && rm -rf /var/lib/apt/lists/* +# OpenMPI from source — required for hybrid Apptainer MPI on HPC clusters. +# +# On a cluster ShapePipe runs as a standard Apptainer "hybrid" MPI job: the +# host's `mpirun` launches one container rank per slot, and the OpenMPI inside +# the image wires the ranks together through PMIx. That handshake requires the +# container's PMIx to be compatible with the host launcher's. Debian bookworm's +# package is OpenMPI 4.1.4 with PMIx 2.x; modern clusters (e.g. candide) now run +# OpenMPI 5.0.x with PMIx 5.x, and a PMIx 2 client cannot talk to a PMIx 5 +# server — so every rank silently degrades to a standalone "rank 0 of 1" and the +# job runs N independent copies instead of one N-rank job. Building OpenMPI +# 5.0.x here (with its bundled PMIx 5 / PRRTE) matches those hosts; the 5.0.x +# series is mutually PMIx-compatible, so this image works against any host +# openmpi/5.0.x module. The stock mpi4py wheel (from uv.lock) dlopens +# libmpi.so.40, the soname this build provides, so it needs no rebuild. +# +# --disable-dlopen links every MCA component statically into libmpi / libpmix: +# it sidesteps an internal-openpmix configure failure (the `pdl` component wants +# libltdl headers otherwise) and is the right posture for a container anyway — +# no fragile runtime dlopen of plugin .so files across the SIF / bind boundary. +ARG OMPI_VERSION=5.0.8 +ARG OMPI_SERIES=v5.0 +RUN cd /tmp && \ + wget -q "https://download.open-mpi.org/release/open-mpi/${OMPI_SERIES}/openmpi-${OMPI_VERSION}.tar.bz2" && \ + tar xjf "openmpi-${OMPI_VERSION}.tar.bz2" && \ + cd "openmpi-${OMPI_VERSION}" && \ + ./configure --prefix=/opt/ompi \ + --with-pmix=internal --with-prrte=internal \ + --with-hwloc=internal --with-libevent=internal \ + --disable-dlopen --disable-sphinx && \ + make -j"$(nproc)" && make install && \ + cd / && rm -rf /tmp/openmpi-* +ENV PATH="/opt/ompi/bin:${PATH}" \ + LD_LIBRARY_PATH="/opt/ompi/lib:${LD_LIBRARY_PATH}" + # uv — fast reproducible Python deps installer. pyproject.toml + uv.lock # are the SSOT; `uv sync --frozen` installs exactly what uv.lock specifies, # so upstream changes only land when we deliberately regenerate the lockfile. diff --git a/docs/source/basic_execution.md b/docs/source/basic_execution.md index 9e7ca63b4..1f17aa598 100644 --- a/docs/source/basic_execution.md +++ b/docs/source/basic_execution.md @@ -37,11 +37,33 @@ shapepipe_run -c ## Running the Pipeline with MPI ShapePipe can also use [mpi4py](https://mpi4py.readthedocs.io/en/stable/) -for managing parallel processes on clusters with multiple nodes. -The `shapepipe_run` script can be run with MPI as follows +to spread work across multiple nodes of a cluster. Set `MODE = mpi` in the +`[EXECUTION]` section of the config and launch with an MPI runner: ```bash -mpiexec -n shapepipe_run +mpiexec -n shapepipe_run -c ``` -where `` is the number of cores to allocate to the run. +where `` is the number of MPI processes to start. + +### Through the container (the supported way on a cluster) + +On a cluster you run ShapePipe from the published image as a standard Apptainer +*hybrid* MPI job: the **host** `mpirun`/`mpiexec` launches one container rank per +slot, and the OpenMPI bundled in the image wires the ranks together. + +```bash +# one-time: pull the runtime image +apptainer pull shapepipe.sif docker://ghcr.io/cosmostat/shapepipe:develop-runtime + +# load a host MPI in the same family as the image's OpenMPI (5.0.x), then launch +module load openmpi +mpirun -n \ + apptainer exec --bind "$PWD:$PWD" shapepipe.sif \ + shapepipe_run -c +``` + +The image ships **OpenMPI 5.0.x** so that its PMIx matches modern cluster +launchers. The host and container MPI must be compatible: if you see *N* copies +of `rank 0 of 1` instead of one *N*-rank job, load a host OpenMPI in the 5.0.x +family. See `example/pbs/candide_mpi.sh` for a complete SLURM batch script. diff --git a/example/pbs/candide_mpi.sh b/example/pbs/candide_mpi.sh index 0c50f88e6..4eeef4a51 100644 --- a/example/pbs/candide_mpi.sh +++ b/example/pbs/candide_mpi.sh @@ -1,48 +1,50 @@ #!/bin/bash - -########################## -# MPI Script for CANDIDE # -########################## - -# Receive email when job finishes or aborts -## #PBS -M @cea.fr -## #PBS -m ea - -# Set a name for the job -#PBS -N shapepipe_mpi - -# Join output and errors in one file -#PBS -j oe - -# Set maximum computing time (e.g. 5min) -#PBS -l walltime=00:05:00 - -# Request number of cores (e.g. 2 from 2 different machines) -#PBS -l nodes=2:ppn=2 - -# Path to the local ShapePipe clone (holds the example configs and data) +# +# Hybrid Apptainer MPI job for candide (SLURM). +# +# ShapePipe runs as a standard Apptainer "hybrid" MPI job: the host `mpirun` +# launches one container rank per SLURM task, and the OpenMPI + mpi4py inside +# the image handle the communication. For the ranks to find one another, the +# container's OpenMPI must speak the same PMIx as the host launcher -- the +# published image ships OpenMPI 5.0.x to match candide's OpenMPI 5.0.x modules. +# (An OpenMPI 4 image silently degrades to N independent "rank 0 of 1" +# processes.) +# +# Submit with: sbatch candide_mpi.sh + +#SBATCH --job-name=shapepipe_mpi +#SBATCH --partition=comp +#SBATCH --nodes=2 +#SBATCH --ntasks=4 +#SBATCH --ntasks-per-node=2 +#SBATCH --time=00:05:00 +#SBATCH --output=%x-%j.log +## #SBATCH --mail-type=END,FAIL +## #SBATCH --mail-user=@cea.fr + +# Path to the local ShapePipe clone (holds the example configs and data). export SPDIR="${SPDIR:-$HOME/shapepipe}" # Path to the ShapePipe runtime image. Pull it once with: # apptainer pull "$SP_IMAGE" docker://ghcr.io/cosmostat/shapepipe:develop-runtime export SP_IMAGE="${SP_IMAGE:-$HOME/shapepipe_develop-runtime.sif}" -# Load the host MPI. ShapePipe runs as a standard "hybrid" Apptainer MPI job: -# the host mpiexec launches one container rank per slot and the in-image -# mpi4py / OpenMPI handle the communication. The image ships OpenMPI 4.1.x, so -# load a host OpenMPI in the same family for ABI compatibility. +# Host MPI. The image ships OpenMPI 5.0.x, and any host OpenMPI in the 5.0.x +# family is PMIx-compatible with it, so the cluster default is fine. If candide's +# default ever moves to a different major series, pin a 5.0.x here instead +# (`module load openmpi/5.0.x`) to keep the host/container PMIx match. module load openmpi -# Run ShapePipe through the container -- no Python environment to activate. The -# clone is bind-mounted at the same path so that $SPDIR resolves identically -# inside the container, where the config references it for the input and output -# directories. -mpiexec --map-by node \ +# `mpirun` inherits the node / task layout from the SLURM allocation; -n is the +# total task count. The clone is bind-mounted at the same path so that $SPDIR +# resolves identically inside the container, where the config references it for +# the input and output directories. +mpirun -n "$SLURM_NTASKS" \ apptainer exec \ --bind "$SPDIR:$SPDIR" \ --env SPDIR="$SPDIR" \ "$SP_IMAGE" \ shapepipe_run -c "$SPDIR/example/pbs/config_mpi.ini" -# Propagate the pipeline's exit code to the batch system +# Propagate the pipeline's exit code to the batch system. exit $? diff --git a/example/pbs/candide_smp.sh b/example/pbs/candide_smp.sh index ac6240afb..bb539c4d6 100644 --- a/example/pbs/candide_smp.sh +++ b/example/pbs/candide_smp.sh @@ -1,22 +1,24 @@ #!/bin/bash +# +# SMP (single-node) Apptainer job for candide (SLURM). +# +# SMP mode parallelises with joblib inside a single process across the allocated +# cores -- no host MPI is involved. Use this for single-node runs; use +# candide_mpi.sh to span multiple nodes. +# +# Submit with: sbatch candide_smp.sh -########################## -# SMP Script for CANDIDE # -########################## +#SBATCH --job-name=shapepipe_smp +#SBATCH --partition=comp +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=4 +#SBATCH --time=00:05:00 +#SBATCH --output=%x-%j.log +## #SBATCH --mail-type=END,FAIL +## #SBATCH --mail-user=@cea.fr -# Receive email when job finishes or aborts -#PBS -M @cea.fr -#PBS -m ea -# Set a name for the job -#PBS -N shapepipe_smp -# Join output and errors in one file -#PBS -j oe -# Set maximum computing time (e.g. 5min) -#PBS -l walltime=00:05:00 -# Request number of cores -#PBS -l nodes=4 - -# Path to the local ShapePipe clone (holds the example configs and data) +# Path to the local ShapePipe clone (holds the example configs and data). export SPDIR="${SPDIR:-$HOME/shapepipe}" # Path to the ShapePipe runtime image. Pull it once with: @@ -25,13 +27,14 @@ export SP_IMAGE="${SP_IMAGE:-$HOME/shapepipe_develop-runtime.sif}" # Run ShapePipe through the container -- no Python environment to activate. The # clone is bind-mounted at the same path so that $SPDIR resolves identically -# inside the container, where the config references it for the input and output -# directories. +# inside the container, where the config references it for input / output +# directories. Keep SMP_BATCH_SIZE in config_smp.ini aligned with +# --cpus-per-task above. apptainer exec \ --bind "$SPDIR:$SPDIR" \ --env SPDIR="$SPDIR" \ "$SP_IMAGE" \ shapepipe_run -c "$SPDIR/example/pbs/config_smp.ini" -# Propagate the pipeline's exit code to the batch system +# Propagate the pipeline's exit code to the batch system. exit $? From d31d4d26ca1244bc90c4b05c8ee56cb562c310f4 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 13:45:41 +0200 Subject: [PATCH 04/20] ci: publish images on every branch push, not just integration branches Tag each pushed branch's image with the branch name so any open PR has a pullable image (apptainer pull ...:-runtime) that can be tested on a real cluster before merge. Same-repo branch pushes always carry a registry-write token, so this is safe; fork PRs still only build+test via the pull_request trigger (they have no token to publish with). Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/deploy-image.yml | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/.github/workflows/deploy-image.yml b/.github/workflows/deploy-image.yml index b3b0dc54e..8f6c4a47c 100644 --- a/.github/workflows/deploy-image.yml +++ b/.github/workflows/deploy-image.yml @@ -3,16 +3,21 @@ name: Docker image — build, test, publish # Single source of truth for ShapePipe's environment is the Dockerfile # (slim Python + apt system deps + uv-frozen wheels). This workflow builds # that image, runs the test suite *inside it* — so CI tests exactly what -# ships — and publishes to ghcr only on pushes to the integration branches. +# ships — and publishes to ghcr. # -# pull_request → build + test, no publish (also works for fork PRs) -# push → build + test + publish (:develop, :latest, …) +# pull_request → build + test, no publish (covers fork PRs, which have no +# registry token) +# push (any branch) → build + test + publish, tagged with the branch name +# (e.g. :develop, :my-feature, and the -runtime variants) +# +# Publishing on every branch push — not just the integration branches — means +# any open PR has a pullable image (`apptainer pull …:-runtime`) that +# can be tested on a real cluster *before* merge. Same-repo branch pushes always +# carry a registry-write token, so this is safe; fork PRs still only build+test. on: push: branches: - - develop - - main - - master + - '**' pull_request: branches: - develop @@ -121,7 +126,8 @@ jobs: docker run --rm -e HYPOTHESIS_PROFILE=ci "$IMAGE" pytest -rX # ---------------------------------------------------------------- - # Publish (push events only — never on pull_request, incl. forks) + # Publish (push events only — never on pull_request, incl. forks). + # Fires on any branch; the image is tagged with the branch name. # ---------------------------------------------------------------- - name: Log in to the Container registry if: github.event_name == 'push' From 6b2b03672897f1b5a73f10ffbf7cb537ede330da Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 13:56:30 +0200 Subject: [PATCH 05/20] felt: scrub personal wayfinding from public shapepipe store Make the public .felt/ team-facing rather than personal collaboration notes: - shapepipe.md root: drop first-person role framing, the 'working agreement with Martin' section, private ~/.claude memory-note pointers, and royal-we voice convention; rewrite as a person-generic gateway (stack division, repo conventions incl. corrected rho-stats/meanshapes boundary, threads). - Delete fabian-coord-bug (body-less personal reminder) and prs-in-flight (personal PR dashboard); rephrase the 3 inbound wikilinks. - Neutralize ngmix-update + docker-uv-revert: strip collaborator names and 'mine'/'we agreed' framing, keep the technical why. Co-Authored-By: Claude Opus 4.8 (1M context) --- .felt/docker-uv-revert/docker-uv-revert.md | 15 +-- .felt/fabian-coord-bug/fabian-coord-bug.md | 10 -- .felt/ngmix-update/ngmix-update.md | 4 +- .felt/prs-in-flight/prs-in-flight.md | 76 ----------- .felt/shapepipe.md | 66 ++++------ .../ci-develop-trigger/ci-develop-trigger.md | 2 +- .felt/shapepipe/mpi-hybrid/mpi-hybrid.md | 124 ++++++++++++++++++ .../smoke-test-read-only.md | 3 +- 8 files changed, 163 insertions(+), 137 deletions(-) delete mode 100644 .felt/fabian-coord-bug/fabian-coord-bug.md delete mode 100644 .felt/prs-in-flight/prs-in-flight.md create mode 100644 .felt/shapepipe/mpi-hybrid/mpi-hybrid.md diff --git a/.felt/docker-uv-revert/docker-uv-revert.md b/.felt/docker-uv-revert/docker-uv-revert.md index 1a68f4c39..dd138a97f 100644 --- a/.felt/docker-uv-revert/docker-uv-revert.md +++ b/.felt/docker-uv-revert/docker-uv-revert.md @@ -6,11 +6,11 @@ tags: - docker - infra created-at: 2026-04-27T11:26:45.677512058+02:00 -outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for Martin''s review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.' +outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.' decisions: base: label: Base image - rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. Martin's canfar concern is satisfied as long as the slim image works on canfar. + rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. The canfar deployment concern is satisfied as long as the slim image works on canfar. default: python-slim options: python-slim: @@ -50,7 +50,7 @@ decisions: label: uv + pyproject + uv.lock; uv sync --frozen in Dockerfile modernize: label: Modernize package versions - rationale: 'We determined which versions MUST stay pinned: only ngmix (Axel''s stable_version branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.' + rationale: 'We determined which versions MUST stay pinned: only ngmix (pinned to a stable_version fork branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.' default: stay-current options: stay-conservative: @@ -58,7 +58,7 @@ decisions: excluded: true excluded_reason: Drift between pyproject signal and lockfile reality; loses the chance to surface numpy-2/pandas-3 incompatibilities at PR time when CI is fast stay-current: - label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to Axel's stable_version branch + label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to its stable_version fork branch insights: ci-fast: claim: 'First CI run on PR #719 went green in 3m31s. uv installed 238 packages in 322ms — everything resolved to prebuilt wheels, no source compilation of galsim/mpi4py/python-pysap/etc. Massive speedup vs. previous build.' @@ -97,11 +97,10 @@ The `--frozen` flag is the discipline mechanism: a stale lockfile cannot ship. ## Followups - Watch CI on #719. The slim-base apt list is conjectural — galsim/mpi4py/python-pysap pull a lot of system deps and we may need to add more (`libatlas-base-dev`, `libblas-dev`, etc). -- If CI needs anything beyond what's in the apt block, that's the surface that benefits from a [[shapepipe/prs-in-flight]] note for next time. -- After this lands, [[shapepipe/prs-in-flight]] PRs #708 and #714 may need a small rebase. -- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — Martin floated it but we agreed slim should work on canfar. +- If CI needs anything beyond what's in the apt block, that's worth noting for next time. +- After this lands, PRs #708 and #714 may need a small rebase. +- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — floated as a possibility, but slim should work on canfar. ## Connections - [[shapepipe]] — root -- [[shapepipe/prs-in-flight]] — touches the testing-scaffold xfail set and the develop-bugs PR diff --git a/.felt/fabian-coord-bug/fabian-coord-bug.md b/.felt/fabian-coord-bug/fabian-coord-bug.md deleted file mode 100644 index 66213d20c..000000000 --- a/.felt/fabian-coord-bug/fabian-coord-bug.md +++ /dev/null @@ -1,10 +0,0 @@ ---- -name: Fabian's coord-propagation bug + image-sim code on github -tags: - - shapepipe - - bug - - collaboration - - future -created-at: 2026-04-27T11:26:52.878118978+02:00 -outcome: 'Fabian: 1-line fix in shapepipe needs porting; first need him to put image-sim code/configs on github so it''s testable. Beg if necessary.' ---- diff --git a/.felt/ngmix-update/ngmix-update.md b/.felt/ngmix-update/ngmix-update.md index 2df017deb..723871e65 100644 --- a/.felt/ngmix-update/ngmix-update.md +++ b/.felt/ngmix-update/ngmix-update.md @@ -1,9 +1,9 @@ --- -name: ngmix library upgrade + Lucy wrapper sync +name: ngmix library upgrade + wrapper sync tags: - shapepipe - ngmix - future created-at: 2026-04-27T11:26:51.026191639+02:00 -outcome: 'Future: replace Axel''s stable_version fork with upstream ngmix; reconcile with Lucy''s cleaned-up wrapper from her visit' +outcome: 'Replace the pinned ngmix fork (a stable_version branch carrying not-yet-upstreamed fixes) with upstream ngmix once those land; reconcile the wrapper afterward.' --- diff --git a/.felt/prs-in-flight/prs-in-flight.md b/.felt/prs-in-flight/prs-in-flight.md deleted file mode 100644 index ff110eb0e..000000000 --- a/.felt/prs-in-flight/prs-in-flight.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -name: PRs in flight after v2 merge -tags: - - shapepipe - - pr -created-at: 2026-04-27T11:26:49.300097608+02:00 -outcome: 'Post-v2 + post-propagation: infra stream now landed (#718 setuptools, #719 uv-lockfile, #728 dependabot+SHA-pin), supply-chain hygiene done (20 → 0 alerts). Issue #712 empirically verified resolved against current `:develop` (all 11 packages in Martin''s May 18 list import in both read-only and writable sandbox modes); comment posted, awaiting Martin reply before closing. Science PRs still open: #714 develop-bugs (closes #709 + #711 only — #712 closes separately), #708 testing-scaffold (mine); #725 centroid shift (Axel), several older Martin PRs (#704 #703 #699 #660 #650 #636), #670 lbaumo file_io. Next thread: merge #714.' -insights: - 714-already-redundant: - claim: 'Surprise from rebasing #714: its Dockerfile commit (cf304f8f, adding astroquery/numba/fitsio + setuptools<81 pin) was *already* redundant on current develop — the v2 merge silently put astroquery/numba/fitsio into pyproject and the v2 Dockerfile installs them via ''pip install -e ".[fitsio]"'' at the end. setuptools<81 went away via #718. So ''rebase to drop the obsolete commit'' wasn''t waiting on #719 — it was already obsolete the moment v2 merged. Worth checking sooner next time before assuming a fix is still load-bearing.' - xfail-mostly-fixable: - claim: 'Most #708 xfails are about to be resolved: canfar_monitor IndentationError (4 xfails) and summary_run -h (1 xfail) are fixed in #714; astroquery/numba/fitsio import xfails (5 modules) resolve in #719 because uv sync installs them from pyproject. Only stile/treecorr corr2 (4 modules) is a separate issue requiring stile removal or upstream patch.' - dependabot-policy: - claim: 'shapepipe now ships `.github/dependabot.yml` (#728) with 14-day cooldown, monthly grouped lockfile PRs, github-actions ecosystem opted in, and SHA-pinned actions across all four workflows. Reasoning lives in the file itself + the #728 PR body. Companion fiber [[shapepipe/sqlitedict-pickle-smell]] tracks the single dismissed alert.' - 712-empirically-resolved: - claim: 'Issue #712 is empirically resolved against current `ghcr.io/cosmostat/shapepipe:develop` (dev target, post-#728). Both the original packages (astroquery, numba, fitsio) and Martin''s May 18 follow-up list (scipy, joblib, importlib_metadata, tqdm, LSSTDESC.Coord, pyyaml, astropy_iers_data, pyerfa) import cleanly in both read-only and writable sandbox modes, as do the three originally-flagged runner modules. Pyproject confirms astroquery/numba/joblib/tqdm are core deps; the rest are transitives of astropy/mccd/modopt/galsim; fitsio is gated in both runtime (`--extra jupyter --extra fitsio`) and dev (`--extra dev`) targets. Comment posted; awaiting Martin reply before closing. Likely root cause of the May 18 report: cached/older image.' -decisions: - setuptools-pin: - label: drop setuptools<81 pin - default: merged - options: - merged: - label: 'Already merged as #718 (c9e71df8) — small one-liner, agreed in transcript' ---- - -Snapshot of CosmoStat/shapepipe PR state, maintained as a living index. - -## Open — infra - -(All infra PRs landed. The dependabot stream is resolved; supply-chain -posture set; SHA-pins in place. See [[shapepipe/sqlitedict-pickle-smell]] -for the one open security-fiber.) - -## Open — issues (mine) - -| # | What | Status | -|---|---|---| -| #712 | Dockerfile missing runtime deps | Empirically resolved against current `:develop` ([comment](https://github.com/CosmoStat/shapepipe/issues/712#issuecomment-4562085977)). Both original list (astroquery/numba/fitsio) and Martin's May 18 follow-up (scipy/joblib/importlib_metadata/tqdm/LSSTDESC.Coord/pyyaml/astropy_iers_data/pyerfa) import cleanly in read-only + writable sandbox modes. Awaiting Martin reply before closing. | -| #711 | summary_run -h crashes | Fixed by #714 (auto-closes on merge) | -| #709 | canfar_monitor IndentationError | Fixed by #714 (auto-closes on merge) | - -## Open — mine (science / fixes) - -| # | Branch | What | Status | -|---|---|---|---| -| #731 | `chore/smoke-test-read-only` | smoke-test in read-only mode | Open. Adds `shapepipe_run_example` wrapper; CI now runs the entry-point smoke under `docker --read-only --tmpfs /tmp:rw`. See [[shapepipe/smoke-test-read-only]]. | -| #714 | `fix/develop-bugs` | small develop bugs (#709, #711) | Open. Originally a multi-bug fix; the Dockerfile portion got absorbed into #719. Worth checking what's still load-bearing here vs already-fixed-upstream. | -| #708 | `chore/testing-scaffold` | Tier 0–2 test scaffolding | Open. Some xfails should have flipped to xpass after the v2 + uv-lockfile work; needs a rebase + xfail-list audit. | - -## Open — others' PRs awaiting attention - -| # | Author | What | -|---|---|---| -| #725 | aguinot | Fix centroid shift | -| #704 | martinkilbinger | Contributors | -| #703 | martinkilbinger | V1.3.x | -| #699 | martinkilbinger | Coverage mask | -| #670 | lbaumo | file_io handles sextractor header | -| #660 | martinkilbinger | Existing output directory | -| #650 | martinkilbinger | Third-party catalogue for tile objects | -| #636 | martinkilbinger | Rho statistics: flexible training/test split | - -## Recently closed - -- **#728** `chore/dependabot-config` — dependabot.yml + SHA-pin all actions. Merged 2026-05-28. -- **#727, #726, #724, #722, #721, #720** — dependabot security bumps for idna/urllib3/gitpython/mistune/jupyter-server/jupyterlab. All squash-merged 2026-05-28 (see [[shapepipe/dependabot-pr-triage]]). -- **#719** `chore/uv-lockfile` — merged 2026-05-05 (Martin). -- **#718** `chore/drop-setuptools-pin` — merged. -- **v2.0 PR** — merged. Source of the skaha/conda situation that #719 unwound. - -## Connections - -- [[shapepipe]] — root -- [[shapepipe/docker-uv-revert]] — drove #719 -- [[shapepipe/dependabot-pr-triage]] — drove the 6 security-bump merges (closed) -- [[shapepipe/sqlitedict-pickle-smell]] — future-work fiber for the one dismissed alert diff --git a/.felt/shapepipe.md b/.felt/shapepipe.md index 40d321969..044d7a3b1 100644 --- a/.felt/shapepipe.md +++ b/.felt/shapepipe.md @@ -1,50 +1,40 @@ --- -name: ShapePipe maintenance & PRs +name: ShapePipe — project knowledge & active threads tags: - shapepipe - - portolan created-at: 2026-04-27T11:26:38.71538657+02:00 -outcome: 'Root: collaboration with Martin on ShapePipe — PRs, infra, future ngmix and Fabian work' +outcome: 'Root of ShapePipe''s felt store: the stack division, repo conventions, and the why behind in-flight infra/cleanup threads.' --- -ShapePipe is the UNIONS shape-measurement pipeline. I'm not the primary -maintainer (that's Martin Kilbinger); my role is collaborator helping -clean up infra, surface bugs, and keep the merge queue moving while -Martin focuses on science threads. +This is the root of ShapePipe's felt store — shared notes on architecture +decisions, conventions, and in-flight work, for the team and AI agents alike. +ShapePipe is the UNIONS galaxy shape-measurement pipeline; `CLAUDE.md` covers the +build / container / CI overview, and the fibers here carry the *why*. Start here, +then follow the links. -## Working agreement with Martin +## Stack division -Surfaced over a 2026-04-27 walking conversation. Captured in -[[shapepipe/prs-in-flight]] and the per-thread fibers below. +ShapePipe **produces** shear catalogues; `sp_validation` / `cosmo_val` +**consume** and validate them; `cs_util` holds code shared across both. A concern +about *validating* catalogues belongs downstream, not in ShapePipe. -- I review and patch his PRs; he reviews mine. Bugs found during review - go to a dedicated PR rather than getting bundled into his feature - branch (per `feedback_separate_infra_prs`). -- v2.0 was merged fast (it was ready). The skaha base it brought in is - the active source of pain → see [[shapepipe/docker-uv-revert]]. -- I file the issues; Claude usually drafts the PRs in my voice. - Disclosure on Claude-only review per - `feedback_claude_only_review_disclosure`. - -## Active threads - -- **[[shapepipe/docker-uv-revert]]** — slim Python + uv lockfile, drop conda. PR #719 (draft). -- **[[shapepipe/prs-in-flight]]** — tracking #708 (testing scaffold), #714 (develop bugs), #719 (this one). - -## Future work +## Conventions specific to this repo -- **[[shapepipe/ngmix-update]]** — replace Axel's stable_version fork - with upstream ngmix; reconcile with Lucy's wrapper. -- **[[shapepipe/fabian-coord-bug]]** — port Fabian's 1-line coord - propagation fix; first need his image-sim code on github. +- **Rho-statistics are obsolete inside ShapePipe.** PSF-systematics validation + moved downstream to `sp_validation` / `cosmo_val` (via `shear_psf_leakage`); + the stile/treecorr rho code was removed in #715. But the **meanshapes / + ellipticity focal-plane plots** (`mccd_plots_runner`) are *deliberately kept* — + they are a general PSF/star-catalogue diagnostic, not rho-stats, and feed + catalogue-paper figures. Don't delete that path along with rho-stats; see + [[shapepipe/cleanup-rhostats-jobscripts]] for where the boundary actually sits. +- Run the pipeline through the container; use `python3.12` explicitly inside it. +- **ngmix** is pinned to a fork branch until fixes land upstream — don't bump + that dependency line. [[ngmix-update]] tracks the path back to upstream. -## Conventions specific to this repo +## Active threads -- Container runs through `app` (apptainer wrapper); use `python3.12` - inside the shapepipe container (see `reference_containers`). -- ShapePipe produces; `sp_validation` consumes; `cs_util` is shared (see - `project_stack_division`). -- Rho stats are obsolete here — sp_validation/cosmo_val took over (see - `project_rho_stats_obsolete`). -- Royal "we" in PR/issue voice; specific findings attributed to Claude - by name (see `feedback_writing_voice_on_cails_behalf`). +- **[[shapepipe/ci-green-on-develop]]** / **[[shapepipe/test-suite]]** — a + tiered, in-image test suite and trustworthy CI on `develop`. +- **[[docker-uv-revert]]** — slim Python base + uv lockfile, dropping conda. +- **[[shapepipe/mpi-hybrid]]** — running hybrid MPI through the container on candide. +- **[[ngmix-update]]** — replacing the pinned ngmix fork with upstream. diff --git a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md index 29ab2f689..629d6c23d 100644 --- a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md +++ b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md @@ -64,7 +64,7 @@ just CI. Deserves its own issue; #732 doesn't touch it. ## Knock-on -[[shapepipe/prs-in-flight]]: **#729** (actions group, bumps `setup-miniconda` +**#729** (actions group, bumps `setup-miniconda` v3→v4) hit the layer-1 failure too — confirming the action bump alone doesn't fix the path. #729 must rebase on top of #732 once it merges before it can go green. The smoke-test work in [[shapepipe/smoke-test-read-only]] diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md new file mode 100644 index 000000000..bb65e544f --- /dev/null +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -0,0 +1,124 @@ +--- +name: ShapePipe hybrid MPI through the container on candide +status: active +tags: + - shapepipe + - mpi + - container + - candide +created-at: 2026-05-31T12:22:50.017370879+02:00 +outcome: 'Container shipped OpenMPI 4.1.4/PMIx2 vs candide host OpenMPI 5.0.x/PMIx5 → hybrid MPI gave N rank-0 singletons. Fix on #737 branch: build OpenMPI 5.0.8 from source (--disable-dlopen, bundled PMIx5/PRRTE), drop libopenmpi-dev, keep mpi4py wheel (uv.lock untouched); SLURM-ify candide scripts (#SBATCH, module load openmpi, mpirun -n $SLURM_NTASKS apptainer exec); CI publishes on every branch push for cluster-testable PR images. Committed+pushed; e2e candide test pending CI image publish.' +--- + +## The problem + +The "MPI verification gap" flagged in [[shapepipe/cleanup-rhostats-jobscripts]]: +PR #737's `candide_mpi.sh` uses the correct Apptainer **hybrid** pattern (host +`mpirun` launches one container rank per task) but couldn't be verified, and the +container/host OpenMPI versions had drifted apart. + +Goal: actually run ShapePipe through the container under MPI on candide, end to +end, following [Apptainer's MPI guidance](https://apptainer.org/docs/user/main/mpi.html). + +## What the data said + +Empirical test on candide (image = `ghcr.io/cosmostat/shapepipe:develop-runtime`, +host `module load openmpi/5.0.8`, single node, 4 ranks): + +``` +mpirun -n 4 apptainer exec $SIF python -m mpi4py.bench helloworld + → Hello, World! I am process 0 of 1 on n23. (×4) +``` + +Four singletons instead of one 4-rank job. Apptainer's docs name this exactly: +*"If your containers run N rank 0 processes … the MPI stack used to launch is not +compatible with the MPI stack in the container."* + +**Root cause — PMIx wire mismatch.** The hybrid model needs the container's MPI +to speak the same PMIx as the host launcher. + +| | OpenMPI | PMIx | +|---|---|---| +| container (Debian bookworm `libopenmpi-dev`) | 4.1.4 | 2.x (`MCA pmix: ext3x`, `--with-pmix=.../pmix2`) | +| candide host (`openmpi/5.0.8`) | 5.0.8 | 5.x (internal) | + +PMIx 2 client cannot connect to the PMIx 5 server PRRTE stands up, so each rank +initializes standalone. (`libmpi.so.40` is ABI-stable across OpenMPI 4↔5, which +is why mpi4py *imports* fine — but import isn't wire-up.) + +## The fix + +Build **OpenMPI 5.0.x from source** in the image (bundled PMIx 5 / PRRTE, +`--with-pmix=internal --with-prrte=internal --with-hwloc=internal +--with-libevent=internal --disable-dlopen`). The stock mpi4py wheel (from +uv.lock) dlopens `libmpi.so.40`, the soname this build provides, so it needs +**no rebuild** and `uv.lock` stays a pure SSOT. `--disable-dlopen` links MCA +components statically — it both fixes an internal-openpmix `pdl` configure +failure (wants libltdl headers otherwise) and is the right posture for a +container (no dlopen of plugin .so across the SIF/bind boundary). + +Proven locally on candide before committing: a minimal proof container compiled +OpenMPI 5.0.8 + built mpi4py clean, and the `--disable-dlopen` flag was found by +iterating the configure step. Then switched to the **build-remotely / pull- +locally** loop (now in CLAUDE.md): edit Dockerfile → push → CI builds and +publishes to GHCR → `apptainer pull` on the cluster → test. Local `apptainer +build` is the wrong default — cluster quotas are tight (hit `disk quota +exceeded` on `$HOME`; keep SIFs + `APPTAINER_TMPDIR`/`CACHEDIR` on a data +partition). CI now publishes on every branch push (not just integration +branches) so any PR has a pullable, cluster-testable image before merge. + +## Keeping host ↔ container MPI in sync (design) + +The container seals off the host's userspace *except* MPI — to use the +interconnect + launcher you need the in-image MPI to cooperate with host +machinery you can't seal off. The contract is narrower than "same version": +what must match is the **PMIx wire protocol** and **launch mechanism**, and +PMIx is compatible *within a major version*. So the compatibility unit is the +**5.0.x series**, not the point release — hence `module load openmpi` (default) +in the job script and `OMPI_VERSION` as a Docker `ARG` (retarget = one number). + +Spectrum for multi-cluster / differing-MPI futures, cheapest → most robust: +1. **Pin a series + track targets** (chosen). One image covers every PMIx-5 + cluster. Most modern HPC is here now. +2. **CI matrix → variants** from the same build-arg (`:…-ompi5`, `:…-ompi4`) + when two targets straddle a PMIx major. One source, N artifacts. +3. **Bind model** (`--bind $MPI_DIR`): no MPI baked, host MPI mounted in — + always matches but fragile (glibc/path/admin-bind caveats). Fallback. +4. **Wi4MPI** (a CEA tool): MPI translation layer, write-once-run-anywhere + across MPI families. Heaviest; the escalation if 1–2 don't suffice. +5. **Preflight self-check** (complements any): run a 2-rank helloworld, detect + the "rank 0 of 1" singleton signature, fail loudly instead of silently + running N independent copies → wrong science. Recommended regardless; turns + silent desync into an obvious error. Not yet implemented — candidate for + this PR or a follow-up. + +## Environment facts (candide, 2026-05) + +- **Scheduler is SLURM**, not PBS — `qsub`/`qstat` are gone; partitions `comp` + (2-day) / `compl` (5-day), idle nodes available. The `#PBS` directives in the + candide job scripts are dead. +- **Host OpenMPI**: modules `openmpi/5.0.3`–`5.0.10`, built `-slurm-CentOS8` + (`/softs/openmpi/5.0.8-slurm-CentOS8`). The 4.0.5 the old script loaded is gone. +- **srun launch is not viable** for OpenMPI 5 here: `srun --mpi=list` → + none/cray_shasta/pmi2 only (no pmix). Use `mpirun` (PRRTE carries PMIx). +- **Local container builds work** via `apptainer build --fakeroot` even without + `/etc/subuid` entries (root-mapped namespace; `allow setuid = yes`). + +## Deliverables (on #737 branch `cleanup/candide-scripts-container`) + +All committed (`4fc948db` MPI fix, `d31d4d26` CI), pushed, CI building. Going +onto the existing #737 PR rather than a new one — this completes the candide- +scripts work #737 started. + +1. **Dockerfile** → OpenMPI 5.0.8 from source, `--disable-dlopen`; libopenmpi-dev + dropped; mpi4py wheel kept (uv.lock untouched). +2. **candide job scripts** → SLURM (`#SBATCH`), `module load openmpi` (default), + `mpirun -n $SLURM_NTASKS apptainer exec … shapepipe_run`. + (`example/pbs/config_mpi.ini` already existed and is correct.) +3. **docs / CLAUDE.md** — hybrid-MPI run pattern; build-remotely/pull-locally loop. +4. **CI** — publish on every branch push so PR images are cluster-testable. + +**Still open:** end-to-end hybrid test on candide once CI publishes the +`:cleanup-candide-scripts-container-runtime` image — pull it, run the example +pipeline under `mpirun -n 4 apptainer exec`, confirm distinct ranks (not the +singleton signature) and 0 errors. That's the empirical close on the whole fix. diff --git a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md index cba960b3c..b9bbe8849 100644 --- a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md +++ b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md @@ -67,5 +67,4 @@ both the runtime and dev target blocks. Sits in the same family as [[shapepipe/docker-multistage]] (which introduced the runtime/dev split) and [[shapepipe/docker-uv-revert]] -(which moved uv writable targets to `/tmp` via env vars). [[shapepipe/prs-in-flight]] -gets a new "in-flight" entry once the PR is up. +(which moved uv writable targets to `/tmp` via env vars). From e5999733327e621d9f24314bd0a9ccb4987f89b5 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 17:00:13 +0200 Subject: [PATCH 06/20] Fix MPI path: thread module_config_sec through to WorkerHandler.worker The MPI execution path was broken since #415 ("Multiple Module Runs"): WorkerHandler.worker() gained a `module_config_sec` parameter, but `submit_mpi_jobs` in mpi_run.py was never updated to pass it. So the MPI path called worker() with 7 args where 8 are required, failing every run with: WorkerHandler.worker() missing 1 required positional argument: 'module_runner' This stayed invisible for 16 months because MPI is a legacy execution mode (SMP is the production path), and on candide MPI couldn't even wire up due to a PMIx version mismatch -- which masked the code bug beneath. Fixing the launcher (OpenMPI 5.0.x in the image) exposed it. Thread `module_config_sec` from run_mpi (root rank, broadcast to all ranks) into submit_mpi_jobs and on to worker(), matching the SMP/serial call sites. Verified end-to-end on candide: 2-node / 4-rank hybrid MPI run of the example pipeline, all three modules complete, 0 errors recorded. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/shapepipe/pipeline/mpi_run.py | 2 ++ src/shapepipe/run.py | 9 ++++++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/src/shapepipe/pipeline/mpi_run.py b/src/shapepipe/pipeline/mpi_run.py index 4aa547a78..3e3684024 100644 --- a/src/shapepipe/pipeline/mpi_run.py +++ b/src/shapepipe/pipeline/mpi_run.py @@ -33,6 +33,7 @@ def split_mpi_jobs(jobs, batch_size): def submit_mpi_jobs( jobs, config, + module_config_sec, timeout, run_dirs, module_runner, @@ -58,6 +59,7 @@ def submit_mpi_jobs( w_log_name, run_dirs, config, + module_config_sec, timeout, module_runner, ) diff --git a/src/shapepipe/run.py b/src/shapepipe/run.py index fe2093a0d..8212fdfb5 100644 --- a/src/shapepipe/run.py +++ b/src/shapepipe/run.py @@ -416,6 +416,7 @@ def run_mpi(pipe, comm): # Get file handler objects run_dirs = jh.filehd.module_run_dirs module_runner = jh.filehd.module_runners[module] + module_config_sec = jh.filehd.get_module_config_sec(module) worker_log = jh.filehd.get_worker_log_name # Define process list process_list = jh.filehd.process_list @@ -423,8 +424,8 @@ def run_mpi(pipe, comm): jobs = split_mpi_jobs(process_list, comm.size) del process_list else: - job_type = module_runner = worker_log = timeout = jobs = ( - run_dirs + job_type = module_runner = worker_log = timeout = jobs = run_dirs = ( + module_config_sec ) = None # Broadcast job type to all nodes @@ -436,6 +437,7 @@ def run_mpi(pipe, comm): run_dirs = comm.bcast(run_dirs, root=0) module_runner = comm.bcast(module_runner, root=0) + module_config_sec = comm.bcast(module_config_sec, root=0) worker_log = comm.bcast(worker_log, root=0) timeout = comm.bcast(timeout, root=0) jobs = comm.scatter(jobs, root=0) @@ -445,6 +447,7 @@ def run_mpi(pipe, comm): submit_mpi_jobs( jobs, config, + module_config_sec, timeout, run_dirs, module_runner, @@ -455,7 +458,7 @@ def run_mpi(pipe, comm): ) # Delete broadcast objects - del module_runner, worker_log, timeout, jobs + del module_runner, module_config_sec, worker_log, timeout, jobs # Finish up parallel jobs if master: From bf9f1e2c2970ee176b7dc794c928bfb01a10f9ba Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 17:03:23 +0200 Subject: [PATCH 07/20] chore: gitignore felt index WAL sidecars (index.db-shm/-wal) Co-Authored-By: Claude Opus 4.8 (1M context) --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitignore b/.gitignore index 756386e12..3097dc78a 100644 --- a/.gitignore +++ b/.gitignore @@ -140,3 +140,5 @@ code .felt/index.db .felt/index-sync.lock .felt/index-sync.request +.felt/index.db-shm +.felt/index.db-wal From a03baf323169a11019f7334ca5eaa0bff709d64f Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 17:03:24 +0200 Subject: [PATCH 08/20] felt: correct mpi-hybrid close (two-layer bug); add exec-modes-schedulers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The earlier mpi-hybrid close claimed the full pipeline ran clean under MPI. It did not — that run hit a latent ShapePipe code bug and the sbatch RUN_EXIT=0 was a hardcoded echo. Rewrite the empirical close to the true two-layer story: launcher (PMIx) fixed and verified, which then exposed the module_config_sec bug (#415), now fixed in e5999733 and re-verified e2e. Reopen status (fix not yet in the published image, #737 not merged). Add exec-modes-schedulers: a reference fiber mapping smp/mpi (execution modes) and PBS/SLURM (schedulers) — what's production (SMP+SLURM) vs legacy (MPI, PBS) — the context that explains why this bug survived 16 months. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../exec-modes-schedulers.md | 65 ++++++++++++++++ .felt/shapepipe/mpi-hybrid/mpi-hybrid.md | 76 +++++++++++++++++-- 2 files changed, 136 insertions(+), 5 deletions(-) create mode 100644 .felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md diff --git a/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md new file mode 100644 index 000000000..b48d57b18 --- /dev/null +++ b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md @@ -0,0 +1,65 @@ +--- +name: 'ShapePipe execution modes (smp/mpi) and schedulers (PBS/SLURM): what''s used vs legacy' +tags: + - shapepipe + - mpi + - reference +created-at: 2026-05-31T16:51:46.221097637+02:00 +outcome: 'SMP is the production workhorse (55/56 example configs; all canfar/candide scripts via N_SMP, SLURM+conda); MPI is 2019 legacy used by 1 config and broken since #415. PBS is dead (2019 example scripts only); SLURM is current everywhere. The MPI module_config_sec bug survived 16mo because nobody runs MPI.' +--- + +Two orthogonal axes that are easy to conflate when reasoning about how ShapePipe +runs on a cluster. This fiber pins down what each is, when it entered, and what's +actually used today vs. legacy — the context for [[shapepipe/mpi-hybrid]]. + +## Axis 1 — execution mode (`[EXECUTION] MODE`, inside ShapePipe) + +Dispatched in `src/shapepipe/run.py`: `mode = config["EXECUTION"]["MODE"].lower()`, +then `run_mpi(pipe, comm)` if `mode == "mpi"` else `run_smp(pipe)`. If mpi4py isn't +importable, mode is forced to `smp`. + +- **`smp`** — joblib `Parallel(n_jobs=batch_size)` across cores on **one node** + (`job_handler._distribute_smp_jobs`). **The living path.** 55 of 56 example + configs set `MODE = SMP`; every canfar/candide production script drives it by + injecting `N_SMP` into the config (`SMP_BATCH_SIZE`). +- **`mpi`** — mpi4py scatter/gather across **multiple nodes** (`pipeline/mpi_run.py`, + `submit_mpi_jobs`). 2019-era (`c6554983` "initial mpi framework"). Exactly **1** + example config uses it. **Broken since PR #415 (Jan 2025)**: `worker()` gained a + `module_config_sec` param and `mpi_run.py` was never updated, so it passed 7 args + where 8 are required. Invisible for 16 months because nobody runs MPI — and on + candide it couldn't even wire up (PMIx mismatch, see [[shapepipe/mpi-hybrid]]), + which masked the code bug underneath. + +Note `MODE` is overloaded across config sections — `CLASSIC`, `MULTI-EPOCH`, +`FIT_VALIDATION`, `VALIDATION` are *module* modes (PSF / ngmix), not `[EXECUTION]` +modes. Only `smp`/`mpi` live under `[EXECUTION]`. + +## Axis 2 — scheduler (the batch wrapper, outside ShapePipe) + +- **PBS** (`#PBS` / `qsub`) — the 2019 `example/pbs/` scripts. **Dead** on candide + (migrated to SLURM). All `#PBS` directives removed on the #737 branch. +- **SLURM** (`#SBATCH` / `sbatch`) — **current everywhere**. canfar since ~2020, + candide since 2024. + +## The story the dates tell + +ShapePipe shifted from **"a few big MPI jobs under PBS on candide" (2019)** to +**"many small SMP jobs under SLURM" (2024+)**. Today's production submission path +is `scripts/sh/run_scratch_local.sh` (2024-11, *"submit jobs on candide"*) → +`init_run_exclusive_canfar.sh` → `job_sp_canfar.bash`: all `sbatch` (SLURM), all +**SMP** mode via `N_SMP`, and still **conda** (`CONDA_PREFIX=$HOME/.conda/envs/shapepipe`), +*not* the container. + +The `example/pbs/candide_{smp,mpi}.sh` scripts are 2019 **teaching examples** +(untouched until #737 branch), not the production path. + +## Implications + +- The MPI bug fix is worth landing — `mpi` is still a supported mode and fixing it + on candide was the goal — but it restores a *legacy* path, it doesn't unblock + production. +- Production canfar/candide scripts (SMP + SLURM + conda) are untouched by #737 and + out of scope; they're also **not yet containerized** — a future gap to name. +- **Open question for Martin / the team:** does anyone still need MPI multi-node + runs at all, or has SMP-under-SLURM (many per-node jobs) fully replaced it? If MPI + is truly dead, the honest move might be to retire it rather than maintain it. diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md index bb65e544f..c6f276103 100644 --- a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -7,7 +7,19 @@ tags: - container - candide created-at: 2026-05-31T12:22:50.017370879+02:00 -outcome: 'Container shipped OpenMPI 4.1.4/PMIx2 vs candide host OpenMPI 5.0.x/PMIx5 → hybrid MPI gave N rank-0 singletons. Fix on #737 branch: build OpenMPI 5.0.8 from source (--disable-dlopen, bundled PMIx5/PRRTE), drop libopenmpi-dev, keep mpi4py wheel (uv.lock untouched); SLURM-ify candide scripts (#SBATCH, module load openmpi, mpirun -n $SLURM_NTASKS apptainer exec); CI publishes on every branch push for cluster-testable PR images. Committed+pushed; e2e candide test pending CI image publish.' +outcome: |- + Two independent bugs, both fixed, verified e2e on candide. (1) LAUNCHER: container + shipped OpenMPI 4.1.4/PMIx2 vs candide host 5.0.x/PMIx5 → hybrid MPI gave N rank-0 + singletons. Fixed by building OpenMPI 5.0.8 from source in the image (--disable-dlopen, + bundled PMIx5/PRRTE), dropping libopenmpi-dev, keeping the mpi4py wheel (uv.lock + untouched); SLURM-ified candide scripts; CI now publishes on every branch push. + (2) SHAPEPIPE CODE: with ranks finally wired up, shapepipe_run under MPI hit + "worker() missing module_runner" — a latent bug since #415 (mpi_run.py never updated + when worker() gained module_config_sec), invisible for 16mo because MPI is the legacy + path (SMP is production). Fixed in e5999733. Re-verified (job 780655, host-src bind + over PR image): 4 ranks n23+n25, all 3 modules ran, real RUN_EXIT=0, 0 errors. + REMAINING: rebuild published image with the code fix (push→CI), then Martin review + + merge of #737. --- ## The problem @@ -117,8 +129,62 @@ scripts work #737 started. (`example/pbs/config_mpi.ini` already existed and is correct.) 3. **docs / CLAUDE.md** — hybrid-MPI run pattern; build-remotely/pull-locally loop. 4. **CI** — publish on every branch push so PR images are cluster-testable. +5. **ShapePipe MPI code fix** (`e5999733`) — thread `module_config_sec` through + `run_mpi`/`submit_mpi_jobs`/`worker()`; the latent #415 bug surfaced once the + launcher worked. Needs an image rebuild to ship. -**Still open:** end-to-end hybrid test on candide once CI publishes the -`:cleanup-candide-scripts-container-runtime` image — pull it, run the example -pipeline under `mpirun -n 4 apptainer exec`, confirm distinct ranks (not the -singleton signature) and 0 errors. That's the empirical close on the whole fix. +## Empirical close (2026-05-31) — two layers + +The fix turned out to have **two independent layers**. The launcher fix +(above) was necessary but not sufficient: making the ranks actually wire up +exposed a second, latent bug in ShapePipe's own MPI code. + +**Layer 1 — launcher (PMIx), verified.** Pulled the PR image on candide and +ran the rank wire-up check (2 nodes, 4 tasks, `module load openmpi` → `mpirun +-n 4 apptainer exec … python -m mpi4py.bench helloworld`): + +``` +Hello, World! I am process 0 of 4 on n23. +Hello, World! I am process 1 of 4 on n23. +Hello, World! I am process 2 of 4 on n25. +Hello, World! I am process 3 of 4 on n25. +``` + +One 4-rank job spanning two nodes — the exact inverse of the pre-fix 4× +"rank 0 of 1". Image reports `Open MPI: 5.0.8`. ✓ + +**Layer 2 — ShapePipe MPI code, was broken, now fixed.** With the ranks wired +up, the actual `shapepipe_run` under MPI immediately hit: + +``` +ERROR: WorkerHandler.worker() missing 1 required positional argument: 'module_runner' +``` + +A latent bug since PR #415: `worker()` gained a `module_config_sec` parameter +and `pipeline/mpi_run.py:submit_mpi_jobs` was never updated, so it passed 7 +args where 8 are required. Invisible for 16 months because **nobody runs MPI** +— SMP is the production path (see [[shapepipe/exec-modes-schedulers]]) and the +PMIx mismatch meant MPI never even started on candide. Fixed by threading +`module_config_sec` through `run_mpi` → `submit_mpi_jobs` → `worker()` (commit +`e5999733`), matching the SMP/serial call sites. + +**Re-verified end to end** (job 780655, PR image with the working-tree `src` +bind-mounted over `/app/src` so the fix is exercised without an image rebuild): +fixed `submit_mpi_jobs` signature live in-container, 4 ranks across n23+n25, +all three modules (`python`/`serial`/`execute_example_runner`) produced output +trees, real `RUN_EXIT=0`, and `shapepipe.log` records *"A total of 0 errors +were recorded."* **Now genuinely verified.** + +> Correction: an earlier close claimed the full pipeline ran clean at this +> point. It did not — that run hit the Layer-2 error and the sbatch script's +> `RUN_EXIT=0` was a hardcoded `echo`, not the real exit code. The launcher +> half was real; the pipeline half was not, until the code fix above. + +**Remaining:** bake the code fix into the published image (push → CI rebuild +of `:cleanup-candide-scripts-container-runtime`), then Martin's review + merge +of #737. + +(Note: the in-image `mpi4py` import looks absent under `bash -lc` because the +login shell resets PATH off the venv — a probe artifact, not real; the actual +`mpirun apptainer exec python -m mpi4py.bench` run resolves it via the image's +default PATH and wires up fine, as the helloworld output shows.) From 7e7b7448843c6963999acd547df59d51cddfa1bc Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 17:24:36 +0200 Subject: [PATCH 09/20] Fix stale module names in example/pbs/config_mpi.ini The MPI example config still used the pre-suffix module names (`python_example`, `serial_example`, `execute_example`) and section headers from 2019-2020; the module loader needs the full runner names (`*_runner`), as example/config.ini uses. With the stale names, rank 0 failed with "No module named 'shapepipe.modules.python_example'" and the other ranks deadlocked in the collective until the wall-clock timeout. Third layer of MPI bit-rot beneath the launcher and the module_config_sec fix, same root cause: nobody runs MPI, so its example config rotted too. Verified: the unmodified candide_mpi.sh against the published runtime image now runs the example pipeline end-to-end (4 ranks / 2 nodes, all three modules, 0 errors, exit 0). Co-Authored-By: Claude Opus 4.8 (1M context) --- example/pbs/config_mpi.ini | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/example/pbs/config_mpi.ini b/example/pbs/config_mpi.ini index bb2b8f95d..cd41c9ea3 100644 --- a/example/pbs/config_mpi.ini +++ b/example/pbs/config_mpi.ini @@ -2,7 +2,7 @@ ## ShapePipe execution options [EXECUTION] -MODULE = python_example, serial_example, execute_example +MODULE = python_example_runner, serial_example_runner, execute_example_runner MODE = mpi ## ShapePipe file handling options @@ -15,8 +15,8 @@ OUTPUT_DIR = $SPDIR/example/output TIMEOUT = 00:01:35 ## Module options -[PYTHON_EXAMPLE] +[PYTHON_EXAMPLE_RUNNER] MESSAGE = The obtained value is: -[SERIAL_EXAMPLE] +[SERIAL_EXAMPLE_RUNNER] ADD_INPUT_DIR = $SPDIR/example/data/numbers, $SPDIR/example/data/letters From be0c7248c2de8f1e34e8b94f621443ac5630c851 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 17:25:36 +0200 Subject: [PATCH 10/20] =?UTF-8?q?felt:=20mpi-hybrid=20=E2=80=94=20record?= =?UTF-8?q?=20Layer=203=20(stale=20config)=20+=20final=20e2e=20verificatio?= =?UTF-8?q?n?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.8 (1M context) --- .felt/shapepipe/mpi-hybrid/mpi-hybrid.md | 79 +++++++++++++++--------- 1 file changed, 51 insertions(+), 28 deletions(-) diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md index c6f276103..830c65722 100644 --- a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -8,18 +8,20 @@ tags: - candide created-at: 2026-05-31T12:22:50.017370879+02:00 outcome: |- - Two independent bugs, both fixed, verified e2e on candide. (1) LAUNCHER: container - shipped OpenMPI 4.1.4/PMIx2 vs candide host 5.0.x/PMIx5 → hybrid MPI gave N rank-0 - singletons. Fixed by building OpenMPI 5.0.8 from source in the image (--disable-dlopen, - bundled PMIx5/PRRTE), dropping libopenmpi-dev, keeping the mpi4py wheel (uv.lock - untouched); SLURM-ified candide scripts; CI now publishes on every branch push. - (2) SHAPEPIPE CODE: with ranks finally wired up, shapepipe_run under MPI hit - "worker() missing module_runner" — a latent bug since #415 (mpi_run.py never updated - when worker() gained module_config_sec), invisible for 16mo because MPI is the legacy - path (SMP is production). Fixed in e5999733. Re-verified (job 780655, host-src bind - over PR image): 4 ranks n23+n25, all 3 modules ran, real RUN_EXIT=0, 0 errors. - REMAINING: rebuild published image with the code fix (push→CI), then Martin review + - merge of #737. + THREE layers of MPI bit-rot, all fixed, verified e2e on candide via the unmodified + candide_mpi.sh against the published image (job 780660: 4 ranks/2 nodes, all 3 modules, + 0 errors, real exit 0). (1) LAUNCHER: container shipped OpenMPI 4.1.4/PMIx2 vs candide + host 5.0.x/PMIx5 → hybrid MPI gave N rank-0 singletons. Fixed by building OpenMPI 5.0.8 + from source in the image (--disable-dlopen, bundled PMIx5/PRRTE), dropping libopenmpi-dev, + keeping the mpi4py wheel (uv.lock untouched); SLURM-ified candide scripts; CI publishes on + every branch push. (2) SHAPEPIPE CODE: with ranks wired up, shapepipe_run hit "worker() + missing module_runner" — latent since #415 (mpi_run.py never updated when worker() gained + module_config_sec). Fixed in e5999733. (3) STALE CONFIG: config_mpi.ini used pre-2020 module + names without the _runner suffix → "No module named python_example" + a 5-min deadlock. + Fixed in 7e7b7448. All three hid for years because nobody runs MPI (SMP is production, + [[shapepipe/exec-modes-schedulers]]). Noted: MPI deadlocks on rank-0 failure instead of + failing fast (follow-up). REMAINING: Martin review + merge of #737; open question whether + MPI should be retired rather than maintained. --- ## The problem @@ -131,7 +133,9 @@ scripts work #737 started. 4. **CI** — publish on every branch push so PR images are cluster-testable. 5. **ShapePipe MPI code fix** (`e5999733`) — thread `module_config_sec` through `run_mpi`/`submit_mpi_jobs`/`worker()`; the latent #415 bug surfaced once the - launcher worked. Needs an image rebuild to ship. + launcher worked. Shipped in the published image (CI rebuild). +6. **Stale example config fix** (`7e7b7448`) — `config_mpi.ini` module names + `*_runner`-suffixed to match the loader; surfaced running the real script. ## Empirical close (2026-05-31) — two layers @@ -168,21 +172,40 @@ PMIx mismatch meant MPI never even started on candide. Fixed by threading `module_config_sec` through `run_mpi` → `submit_mpi_jobs` → `worker()` (commit `e5999733`), matching the SMP/serial call sites. -**Re-verified end to end** (job 780655, PR image with the working-tree `src` -bind-mounted over `/app/src` so the fix is exercised without an image rebuild): -fixed `submit_mpi_jobs` signature live in-container, 4 ranks across n23+n25, -all three modules (`python`/`serial`/`execute_example_runner`) produced output -trees, real `RUN_EXIT=0`, and `shapepipe.log` records *"A total of 0 errors -were recorded."* **Now genuinely verified.** - -> Correction: an earlier close claimed the full pipeline ran clean at this -> point. It did not — that run hit the Layer-2 error and the sbatch script's -> `RUN_EXIT=0` was a hardcoded `echo`, not the real exit code. The launcher -> half was real; the pipeline half was not, until the code fix above. - -**Remaining:** bake the code fix into the published image (push → CI rebuild -of `:cleanup-candide-scripts-container-runtime`), then Martin's review + merge -of #737. +Verified with a host-src override (job 780655): fixed `submit_mpi_jobs` +signature live in-container, 4 ranks across n23+n25, all three modules +produced output, real `RUN_EXIT=0`, 0 errors. + +**Layer 3 — stale example config, now fixed.** With the code fix baked into +the published image, the *actual* unmodified `candide_mpi.sh` against +`config_mpi.ini` first hit `No module named 'shapepipe.modules.python_example'` +then deadlocked to the 5-min wall clock. `config_mpi.ini` (last touched 2020) +still used the pre-suffix module names (`python_example`, `[PYTHON_EXAMPLE]`); +the loader needs the full runner names (`python_example_runner`, +`[PYTHON_EXAMPLE_RUNNER]`), as `example/config.ini` uses. Updated to match +(commit `7e7b7448`). Same root cause as Layers 1–2: nobody runs MPI, so its +example config rotted too. + +**Note — MPI deadlocks on rank-0 setup failure** instead of failing fast: when +rank 0 errored on the bad module name, the other ranks blocked in a collective +until SLURM killed the job at the wall clock. This is exactly the failure mode +the "preflight self-check / fail loudly" item (option 5 in the spectrum above) +guards against — worth a follow-up so a stale config or desync surfaces as an +immediate error, not a silent 5-minute hang. Out of scope for #737. + +**Genuinely verified end to end** (job 780660): the unmodified `candide_mpi.sh` +against the freshly-published `:cleanup-candide-scripts-container-runtime` image +(fix baked in, no override) ran the example pipeline — 4 ranks / 2 nodes, all +three `*_example_runner` modules produced output trees, *"A total of 0 errors +were recorded"*, real exit 0 (the script's `exit $?`). The deliverable script +itself works. + +> Correction: an earlier close claimed the full pipeline ran clean before any +> code fix. It did not — that run hit the Layer-2 error and the sbatch script's +> `RUN_EXIT=0` was a hardcoded `echo`, not the real exit code. The launcher half +> was real; the pipeline half was not, until the fixes above. + +**Remaining:** Martin's review + merge of #737. (Note: the in-image `mpi4py` import looks absent under `bash -lc` because the login shell resets PATH off the venv — a probe artifact, not real; the actual From 0c2103c83c15324daedd0cb2d7bcc782240d9e6b Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 18:17:01 +0200 Subject: [PATCH 11/20] felt: temper MPI claims to observed-vs-inferred (canfar run history unknown) Walk back 'nobody runs MPI / invisible for 16 months' across both fibers. What we observed: MPI needed three fixes to run on candide; the code bug dates to #415 by git history; the canfar/candide tooling is SMP-only. What we cannot see: how MPI was actually used, especially on canfar where most processing ran. State the evidence, not the inference about practice. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../exec-modes-schedulers.md | 50 ++++++++++--------- .felt/shapepipe/mpi-hybrid/mpi-hybrid.md | 27 ++++++---- 2 files changed, 42 insertions(+), 35 deletions(-) diff --git a/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md index b48d57b18..b601bc189 100644 --- a/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md +++ b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md @@ -1,11 +1,11 @@ --- -name: 'ShapePipe execution modes (smp/mpi) and schedulers (PBS/SLURM): what''s used vs legacy' +name: 'ShapePipe execution modes (smp/mpi) and schedulers (PBS/SLURM): what the repo''s tooling shows' tags: - shapepipe - mpi - reference created-at: 2026-05-31T16:51:46.221097637+02:00 -outcome: 'SMP is the production workhorse (55/56 example configs; all canfar/candide scripts via N_SMP, SLURM+conda); MPI is 2019 legacy used by 1 config and broken since #415. PBS is dead (2019 example scripts only); SLURM is current everywhere. The MPI module_config_sec bug survived 16mo because nobody runs MPI.' +outcome: 'By the repo''s lights SMP is the exercised path (55/56 example configs; every canfar/candide job script is SMP-only via N_SMP, SLURM+conda); MPI is the 2019 mode, set in 1 config, and its code/config drifted out of sync (module_config_sec bug dates to #415 by git history). PBS is dead (2019 example scripts only); SLURM is current everywhere. CAVEAT: this is what the repo shows, not how ShapePipe was actually run — canfar carried most processing and is invisible from here, so MPI usage history is unknown.' --- Two orthogonal axes that are easy to conflate when reasoning about how ShapePipe @@ -24,11 +24,12 @@ importable, mode is forced to `smp`. injecting `N_SMP` into the config (`SMP_BATCH_SIZE`). - **`mpi`** — mpi4py scatter/gather across **multiple nodes** (`pipeline/mpi_run.py`, `submit_mpi_jobs`). 2019-era (`c6554983` "initial mpi framework"). Exactly **1** - example config uses it. **Broken since PR #415 (Jan 2025)**: `worker()` gained a - `module_config_sec` param and `mpi_run.py` was never updated, so it passed 7 args - where 8 are required. Invisible for 16 months because nobody runs MPI — and on - candide it couldn't even wire up (PMIx mismatch, see [[shapepipe/mpi-hybrid]]), - which masked the code bug underneath. + example config uses it. The `worker()` call in `mpi_run.py` has been out of sync + since PR #415 (Jan 2025) — `worker()` gained a `module_config_sec` param and + `mpi_run.py` wasn't updated, so it passes 7 args where 8 are required. On candide + it couldn't even wire up (PMIx mismatch, see [[shapepipe/mpi-hybrid]]), so the + code bug couldn't surface here. Whether MPI was run elsewhere (canfar especially, + which we can't see) is unknown — what's clear is the repo's tooling is all SMP. Note `MODE` is overloaded across config sections — `CLASSIC`, `MULTI-EPOCH`, `FIT_VALIDATION`, `VALIDATION` are *module* modes (PSF / ngmix), not `[EXECUTION]` @@ -41,25 +42,26 @@ modes. Only `smp`/`mpi` live under `[EXECUTION]`. - **SLURM** (`#SBATCH` / `sbatch`) — **current everywhere**. canfar since ~2020, candide since 2024. -## The story the dates tell +## What the dates and tooling show -ShapePipe shifted from **"a few big MPI jobs under PBS on candide" (2019)** to -**"many small SMP jobs under SLURM" (2024+)**. Today's production submission path -is `scripts/sh/run_scratch_local.sh` (2024-11, *"submit jobs on candide"*) → -`init_run_exclusive_canfar.sh` → `job_sp_canfar.bash`: all `sbatch` (SLURM), all -**SMP** mode via `N_SMP`, and still **conda** (`CONDA_PREFIX=$HOME/.conda/envs/shapepipe`), -*not* the container. +The maintained submission tooling is SMP-only and SLURM-based: `scripts/sh/run_scratch_local.sh` +(2024-11, *"submit jobs on candide"*) → `init_run_exclusive_canfar.sh` → `job_sp_canfar.bash`, +all `sbatch`, all **SMP** via `N_SMP` ("SMP mode only" in their help), and still **conda** +(`CONDA_PREFIX=$HOME/.conda/envs/shapepipe`), *not* the container. The `example/pbs/candide_{smp,mpi}.sh` +scripts are 2019 **teaching examples** (untouched until the #737 branch). -The `example/pbs/candide_{smp,mpi}.sh` scripts are 2019 **teaching examples** -(untouched until #737 branch), not the production path. +This is evidence about the tooling, not a claim about run history. It's suggestive — the +SMP tooling is what's been maintained, the MPI mode and its example config drifted untouched — +but most processing ran on canfar, which isn't visible from this repo, so how much MPI was +actually used is a question for the people who ran it, not something the repo can answer. ## Implications -- The MPI bug fix is worth landing — `mpi` is still a supported mode and fixing it - on candide was the goal — but it restores a *legacy* path, it doesn't unblock - production. -- Production canfar/candide scripts (SMP + SLURM + conda) are untouched by #737 and - out of scope; they're also **not yet containerized** — a future gap to name. -- **Open question for Martin / the team:** does anyone still need MPI multi-node - runs at all, or has SMP-under-SLURM (many per-node jobs) fully replaced it? If MPI - is truly dead, the honest move might be to retire it rather than maintain it. +- The MPI fix is worth landing — `mpi` is a supported mode and getting it working through + the container on candide was the point — framed as enablement/verification, not as + unblocking some known-active workload. +- Production scripts (SMP + SLURM + conda) are untouched by #737 and out of scope; they're + also **not yet containerized** — a future gap to name. +- **Open question for Martin / the team:** is multi-node MPI still needed, or has + SMP-under-SLURM become how things are run? He'd know the real history; the repo only + shows the tooling. If MPI isn't used, retiring it may beat maintaining it. diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md index 830c65722..d5ec5582e 100644 --- a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -18,8 +18,9 @@ outcome: |- missing module_runner" — latent since #415 (mpi_run.py never updated when worker() gained module_config_sec). Fixed in e5999733. (3) STALE CONFIG: config_mpi.ini used pre-2020 module names without the _runner suffix → "No module named python_example" + a 5-min deadlock. - Fixed in 7e7b7448. All three hid for years because nobody runs MPI (SMP is production, - [[shapepipe/exec-modes-schedulers]]). Noted: MPI deadlocks on rank-0 failure instead of + Fixed in 7e7b7448. All three drifted undetected because the repo's exercised path is SMP, + not MPI ([[shapepipe/exec-modes-schedulers]]); actual MPI run history (esp. canfar) is + unknown from here. Noted: MPI deadlocks on rank-0 failure instead of failing fast (follow-up). REMAINING: Martin review + merge of #737; open question whether MPI should be retired rather than maintained. --- @@ -164,13 +165,16 @@ up, the actual `shapepipe_run` under MPI immediately hit: ERROR: WorkerHandler.worker() missing 1 required positional argument: 'module_runner' ``` -A latent bug since PR #415: `worker()` gained a `module_config_sec` parameter -and `pipeline/mpi_run.py:submit_mpi_jobs` was never updated, so it passed 7 -args where 8 are required. Invisible for 16 months because **nobody runs MPI** -— SMP is the production path (see [[shapepipe/exec-modes-schedulers]]) and the -PMIx mismatch meant MPI never even started on candide. Fixed by threading -`module_config_sec` through `run_mpi` → `submit_mpi_jobs` → `worker()` (commit -`e5999733`), matching the SMP/serial call sites. +By git history this dates to PR #415: `worker()` gained a `module_config_sec` +parameter and `pipeline/mpi_run.py:submit_mpi_jobs` wasn't updated in step, so +it passes 7 args where 8 are required. On candide this path wasn't reachable +until the launcher fix (PMIx never let MPI start here), so it couldn't surface +on this cluster before. How much MPI has actually been exercised elsewhere — +canfar especially, which we can't see from here — is unknown; what we can say +is the repo's tooling points entirely at SMP (see +[[shapepipe/exec-modes-schedulers]]). Fixed by threading `module_config_sec` +through `run_mpi` → `submit_mpi_jobs` → `worker()` (commit `e5999733`), +matching the SMP/serial call sites. Verified with a host-src override (job 780655): fixed `submit_mpi_jobs` signature live in-container, 4 ranks across n23+n25, all three modules @@ -183,8 +187,9 @@ then deadlocked to the 5-min wall clock. `config_mpi.ini` (last touched 2020) still used the pre-suffix module names (`python_example`, `[PYTHON_EXAMPLE]`); the loader needs the full runner names (`python_example_runner`, `[PYTHON_EXAMPLE_RUNNER]`), as `example/config.ini` uses. Updated to match -(commit `7e7b7448`). Same root cause as Layers 1–2: nobody runs MPI, so its -example config rotted too. +(commit `7e7b7448`). Same flavour as Layers 1–2: the MPI path's tooling and +example config drifted out of sync with the rest of the repo, undetected, +because the repo's exercised path is SMP, not MPI. **Note — MPI deadlocks on rank-0 setup failure** instead of failing fast: when rank 0 errored on the bad module name, the other ranks blocked in a collective From e82de0e1d67bcb6ed7e880cc5d3cd2fa8d344986 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 18:31:24 +0200 Subject: [PATCH 12/20] felt: record SMP==MPI same-worker finding; sharpen MPI question to Martin MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SMP and MPI call the identical worker() with identical args — same computation, two dispatchers (joblib-on-node vs MPI scatter/gather). worker_handler has no MPI; the workload is embarrassingly parallel. So MPI is an ergonomic convenience, not a computational need. Defer to Martin (in #737) whether MPI earns its keep on candide vs just using SMP; don't retire the documented mode unilaterally. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../exec-modes-schedulers.md | 23 ++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md index b601bc189..2bfc76d5c 100644 --- a/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md +++ b/.felt/shapepipe/exec-modes-schedulers/exec-modes-schedulers.md @@ -31,6 +31,18 @@ importable, mode is forced to `smp`. code bug couldn't surface here. Whether MPI was run elsewhere (canfar especially, which we can't see) is unknown — what's clear is the repo's tooling is all SMP. +**SMP and MPI are the same computation behind two dispatchers.** Both call the +identical `WorkerHandler.worker()` with the identical 8 args (`job_handler._distribute_smp_jobs` +vs `mpi_run.submit_mpi_jobs`). The MPI path's only inter-rank traffic is `bcast` +of setup objects, one `scatter` of the independent job-list, and one `gather` of +result dicts — `worker_handler.py` (the actual work) has zero MPI in it. No +`Send`/`Recv`/`Allreduce`/`Barrier` during compute. That's the signature of an +**embarrassingly parallel** workload: MPI provides no computational capability +that SMP-on-a-node-plus-a-scheduler lacks — it's a job-distribution convenience +(one `mpirun` spanning nodes vs. the submission layer fanning out per-node jobs). +This is what grounds the "is MPI worth keeping?" question to Martin — observed +from the comm pattern, not inferred from usage. + Note `MODE` is overloaded across config sections — `CLASSIC`, `MULTI-EPOCH`, `FIT_VALIDATION`, `VALIDATION` are *module* modes (PSF / ngmix), not `[EXECUTION]` modes. Only `smp`/`mpi` live under `[EXECUTION]`. @@ -62,6 +74,11 @@ actually used is a question for the people who ran it, not something the repo ca unblocking some known-active workload. - Production scripts (SMP + SLURM + conda) are untouched by #737 and out of scope; they're also **not yet containerized** — a future gap to name. -- **Open question for Martin / the team:** is multi-node MPI still needed, or has - SMP-under-SLURM become how things are run? He'd know the real history; the repo only - shows the tooling. If MPI isn't used, retiring it may beat maintaining it. +- **Decision deferred to Martin (asked in #737):** is MPI worth getting working / + maintaining on candide at all, or should candide just use SMP (which works through + the container — `candide_smp.sh`)? Given SMP and MPI are the same computation, MPI + earns its keep only as an ergonomic convenience. We do *not* retire it unilaterally — + it's a documented public mode; #737 leaves it in working order and Martin makes the + call. If kept, add a CI smoke so it can't silently rot again; if dropped, removal is + clean and contained (`mpi_run.py`, `run_mpi`, the `import_mpi` branches, `mpi4py`, + `candide_mpi.sh`). From 33494d7449f44a1d6fc8e02dcad9454e0beed498 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 18:49:26 +0200 Subject: [PATCH 13/20] Propagate shapepipe_run's exit code (main must return run's value) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `main()` called `run(args)` but discarded its return value, so `exit(main())` was always `exit(None)` → 0. `run()` returns 1 when it catches an error (`catch_error` + `return 1`), so *every* handled failure has been exiting 0 — invisible to `exit $?` in the job scripts and to any CI/automation. One-word fix: `return run(args)`. Add a regression test that main forwards run's value. Surfaced while end-to-end testing the MPI singleton guard: the guard fired and logged loudly but the job still exited 0 until this fix. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/shapepipe/shapepipe_run.py | 2 +- tests/unit/test_entrypoints.py | 15 +++++++++++++++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/src/shapepipe/shapepipe_run.py b/src/shapepipe/shapepipe_run.py index 3cc3893d6..ceb98765e 100755 --- a/src/shapepipe/shapepipe_run.py +++ b/src/shapepipe/shapepipe_run.py @@ -15,7 +15,7 @@ def main(args=None): - run(args) + return run(args) if __name__ == "__main__": diff --git a/tests/unit/test_entrypoints.py b/tests/unit/test_entrypoints.py index 22d898d10..8008aa36f 100644 --- a/tests/unit/test_entrypoints.py +++ b/tests/unit/test_entrypoints.py @@ -47,3 +47,18 @@ def test_console_entrypoint_help_runs(entrypoint): assert result.returncode == 0, result.stderr assert "usage:" in result.stdout.lower() + + +@pytest.mark.parametrize("exit_code", [1, None]) +def test_main_propagates_run_exit_code(monkeypatch, exit_code): + """``main`` must forward ``run``'s return value. + + ``run`` returns 1 when it catches an error; if ``main`` drops that, + ``exit(main())`` becomes ``exit(0)`` and every handled failure looks like + success to the batch system. + """ + import shapepipe.shapepipe_run as entry + + monkeypatch.setattr(entry, "run", lambda args=None: exit_code) + + assert entry.main() == exit_code From 2289e6a7c2325ccaa97f58a60833eb5058865400 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 18:49:26 +0200 Subject: [PATCH 14/20] Add MPI world-size preflight check: fail loudly on "rank 0 of N singletons" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When the host MPI launcher and the container's MPI/PMIx stack are incompatible, every process initialises standalone (COMM_WORLD size 1, rank 0). ShapePipe then treats each as master, hands each the full job list, and runs N uncoordinated copies of the pipeline into the same output directory — silently, with exit 0. This is the failure the OpenMPI-5 image fix prevents on candide, but nothing guarded against a future recurrence on another cluster. check_mpi_world() compares the size that actually wired up (COMM_WORLD) against the size the launcher intended (OMPI_COMM_WORLD_SIZE, which is set per process even when the world fails to form) and aborts on a mismatch. Empirically verified on candide: SLURM_NTASKS is NOT reliable for this (reads 1 on remote-node ranks even in a healthy run) — OMPI_COMM_WORLD_SIZE is. Tested both ways on a real allocation: healthy OMPI-5 run passes and completes; OMPI-4 image under the OMPI-5 host launcher fires the check and exits non-zero (together with the exit-code fix). Also catches partial wire-up (N-1 of N). Co-Authored-By: Claude Opus 4.8 (1M context) --- src/shapepipe/pipeline/mpi_run.py | 43 +++++++++++++++++++++ src/shapepipe/run.py | 11 +++++- tests/unit/test_mpi_world.py | 62 +++++++++++++++++++++++++++++++ 3 files changed, 115 insertions(+), 1 deletion(-) create mode 100644 tests/unit/test_mpi_world.py diff --git a/src/shapepipe/pipeline/mpi_run.py b/src/shapepipe/pipeline/mpi_run.py index 3e3684024..f03a45e68 100644 --- a/src/shapepipe/pipeline/mpi_run.py +++ b/src/shapepipe/pipeline/mpi_run.py @@ -6,9 +6,52 @@ """ +import os + from shapepipe.pipeline.worker_handler import WorkerHandler +def check_mpi_world(comm): + """Check MPI World. + + Verify that the MPI world formed at the size the launcher requested, and + abort loudly otherwise. + + This guards against the "N rank-0 singletons" failure: when the host MPI + launcher and the container's MPI / PMIx stack are incompatible, each + process initialises standalone (``COMM_WORLD`` size 1, rank 0). ShapePipe + would then treat every process as master, hand each the full job list, and + run that many uncoordinated copies of the pipeline into the same output + directory -- silently, with exit code 0. Comparing the size that actually + wired up against the size the launcher intended (``OMPI_COMM_WORLD_SIZE``, + which is set per process even when the world fails to form) turns that + silent corruption into an immediate, legible error. + + Parameters + ---------- + comm : MPI.Comm + MPI communicator instance (``MPI.COMM_WORLD``) + + Raises + ------ + RuntimeError + if the launcher requested more ranks than actually wired up + + """ + intended = os.environ.get("OMPI_COMM_WORLD_SIZE") + actual = comm.Get_size() + + if intended is not None and int(intended) != actual: + raise RuntimeError( + f"MPI world mismatch: the launcher requested {intended} ranks but " + f"only {actual} wired up (MPI_COMM_WORLD size {actual}). This is " + f"the 'rank 0 of N singletons' failure -- the host MPI launcher and " + f"the container's MPI/PMIx stack are incompatible. Aborting rather " + f"than running {intended} uncoordinated copies of the pipeline into " + f"the same output directory." + ) + + def split_mpi_jobs(jobs, batch_size): """Split MPI Jobs. diff --git a/src/shapepipe/run.py b/src/shapepipe/run.py index 8212fdfb5..1716abaf3 100644 --- a/src/shapepipe/run.py +++ b/src/shapepipe/run.py @@ -20,7 +20,11 @@ from shapepipe.pipeline.dependency_handler import DependencyHandler from shapepipe.pipeline.file_handler import FileHandler from shapepipe.pipeline.job_handler import JobHandler -from shapepipe.pipeline.mpi_run import split_mpi_jobs, submit_mpi_jobs +from shapepipe.pipeline.mpi_run import ( + check_mpi_world, + split_mpi_jobs, + submit_mpi_jobs, +) try: from mpi4py import MPI @@ -372,6 +376,11 @@ def run_mpi(pipe, comm): # Assign master node master = comm.rank == 0 + # Fail loudly if the MPI world did not form at the size the launcher + # requested (the "rank 0 of N singletons" launcher/container mismatch), + # rather than silently running redundant copies of the pipeline. + check_mpi_world(comm) + # Get the module to be run modules = pipe.modules if master else None modules = comm.bcast(modules, root=0) diff --git a/tests/unit/test_mpi_world.py b/tests/unit/test_mpi_world.py new file mode 100644 index 000000000..bc2c24e12 --- /dev/null +++ b/tests/unit/test_mpi_world.py @@ -0,0 +1,62 @@ +"""Guard against the silent "rank 0 of N singletons" MPI failure. + +When the host MPI launcher and the container's MPI/PMIx stack are +incompatible, every process initialises standalone (``COMM_WORLD`` size 1), +and ShapePipe would otherwise run N uncoordinated copies of the pipeline into +the same output directory -- silently, with exit code 0. +``check_mpi_world`` turns that into an immediate error by comparing the size +that actually wired up against the launcher's intended ``OMPI_COMM_WORLD_SIZE``. +The intended/actual pairs below are the values measured on candide for a +healthy run and for the OpenMPI-4-container / OpenMPI-5-host mismatch. +""" + +import pytest + +from shapepipe.pipeline.mpi_run import check_mpi_world + + +class _FakeComm: + """Minimal stand-in exposing only ``Get_size``.""" + + def __init__(self, size): + + self._size = size + + def Get_size(self): + + return self._size + + +@pytest.mark.parametrize( + "intended, actual", + [ + ("4", 4), # healthy multi-node run + (None, 1), # no launcher env -> legitimate single-rank run + ("1", 1), # explicit single rank + ], + ids=["healthy", "no-launcher-env", "single-rank"], +) +def test_check_mpi_world_passes(monkeypatch, intended, actual): + + if intended is None: + monkeypatch.delenv("OMPI_COMM_WORLD_SIZE", raising=False) + else: + monkeypatch.setenv("OMPI_COMM_WORLD_SIZE", intended) + + check_mpi_world(_FakeComm(actual)) + + +@pytest.mark.parametrize( + "intended, actual", + [ + ("4", 1), # the measured singleton failure + ("4", 3), # partial wire-up + ], + ids=["singletons", "partial-wireup"], +) +def test_check_mpi_world_aborts_on_mismatch(monkeypatch, intended, actual): + + monkeypatch.setenv("OMPI_COMM_WORLD_SIZE", intended) + + with pytest.raises(RuntimeError, match="MPI world mismatch"): + check_mpi_world(_FakeComm(actual)) From 8e00b8bdce2eafcd9cec97e4d806134d44a564c1 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 18:51:26 +0200 Subject: [PATCH 15/20] felt: record Layer 4 hardening (singleton guard + exit-code fix) The 'warning sign' pass: added check_mpi_world preflight (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS proven unreliable) and, found while testing it e2e, fixed main() swallowing run()'s exit code (every caught error had exited 0). Both tested on a real allocation. Distinct remaining gap: mid-setup rank-0 failure still deadlocks the other ranks. Co-Authored-By: Claude Opus 4.8 (1M context) --- .felt/shapepipe/mpi-hybrid/mpi-hybrid.md | 48 +++++++++++++++++++----- 1 file changed, 39 insertions(+), 9 deletions(-) diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md index d5ec5582e..7152f6137 100644 --- a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -20,9 +20,12 @@ outcome: |- names without the _runner suffix → "No module named python_example" + a 5-min deadlock. Fixed in 7e7b7448. All three drifted undetected because the repo's exercised path is SMP, not MPI ([[shapepipe/exec-modes-schedulers]]); actual MPI run history (esp. canfar) is - unknown from here. Noted: MPI deadlocks on rank-0 failure instead of - failing fast (follow-up). REMAINING: Martin review + merge of #737; open question whether - MPI should be retired rather than maintained. + unknown from here. HARDENING PASS added a preflight guard (check_mpi_world, 2289e6a7: aborts + when OMPI_COMM_WORLD_SIZE != COMM_WORLD size — the singleton signature; SLURM_NTASKS is NOT + reliable for this) and, found while testing it, fixed a swallowed exit code (33494d74: main() + now returns run()'s value — every caught error had been exiting 0). Both tested + verified on + a real allocation. STILL OPEN: deadlock when rank 0 fails mid-setup for non-singleton reasons. + REMAINING: Martin review + merge of #737; open question whether MPI should be retired. --- ## The problem @@ -137,6 +140,10 @@ scripts work #737 started. launcher worked. Shipped in the published image (CI rebuild). 6. **Stale example config fix** (`7e7b7448`) — `config_mpi.ini` module names `*_runner`-suffixed to match the loader; surfaced running the real script. +7. **MPI singleton preflight guard** (`2289e6a7`) — `check_mpi_world()` aborts on + `OMPI_COMM_WORLD_SIZE` ≠ `COMM_WORLD` size; unit + real-allocation tested. +8. **Exit-code propagation fix** (`33494d74`) — `main()` returns `run()`'s value; + every caught error had been exiting 0. + regression test. ## Empirical close (2026-05-31) — two layers @@ -191,12 +198,35 @@ the loader needs the full runner names (`python_example_runner`, example config drifted out of sync with the rest of the repo, undetected, because the repo's exercised path is SMP, not MPI. -**Note — MPI deadlocks on rank-0 setup failure** instead of failing fast: when -rank 0 errored on the bad module name, the other ranks blocked in a collective -until SLURM killed the job at the wall clock. This is exactly the failure mode -the "preflight self-check / fail loudly" item (option 5 in the spectrum above) -guards against — worth a follow-up so a stale config or desync surfaces as an -immediate error, not a silent 5-minute hang. Out of scope for #737. +## Layer 4 — silent-failure hardening (the "warning sign") + +A deeper pass on the singleton failure (option 5 in the spectrum above) turned +up two more silent-failure paths and fixed both: + +**(a) No preflight guard against the singleton signature.** In the singleton +case every process is master, `split_mpi_jobs(list, 1)` hands each the *full* +job list, and they all run the whole pipeline into the same output dir — N +uncoordinated copies, exit 0, plausible-but-wrong. Added `check_mpi_world()` +(`mpi_run.py`, called at the top of `run_mpi`): compares the size that wired up +(`COMM_WORLD`) against the size the launcher intended (`OMPI_COMM_WORLD_SIZE`) +and aborts on a mismatch. Empirically: **`SLURM_NTASKS` is NOT usable** for this +— it reads `1` on remote-node ranks even in a healthy run — `OMPI_COMM_WORLD_SIZE` +is the reliable signal (it's `4` in both healthy and singleton; only `COMM_WORLD` +differs). Commit `2289e6a7`, unit-tested + verified on a real allocation (healthy +passes; OMPI-4-image-under-OMPI-5-host fires the abort). + +**(b) Swallowed exit code (the bigger one).** Testing (a) end-to-end exposed that +the guard fired and logged loudly but the job *still exited 0*: `main()` in +`shapepipe_run.py` called `run(args)` without returning it, so `exit(main())` was +always `exit(None)` → 0. **Every caught error in ShapePipe — not just MPI — has +been exiting 0**, invisible to `exit $?` and CI. Fixed to `return run(args)` +(`33494d74`) + regression test. With both, the broken case now exits 1. + +**Still open (distinct gap):** when rank 0 fails *mid-setup* for a non-singleton +reason (e.g. the stale-config module error in Layer 3), ranks 1..N block in the +following `bcast`/`scatter` until the wall clock — the guard runs *before* module +loading, so it doesn't cover this. Fixing it needs collective error propagation +(rank 0 signalling failure before the barrier). Left as a follow-up. **Genuinely verified end to end** (job 780660): the unmodified `candide_mpi.sh` against the freshly-published `:cleanup-candide-scripts-container-runtime` image From d83be28c7c46699b178d941fb0b425deb22db983 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 19:10:52 +0200 Subject: [PATCH 16/20] Remove MPI singleton preflight guard from this PR (defer pending decision) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reverts the check_mpi_world preflight added earlier in this branch. It guards a real but narrow failure (the "rank 0 of N singletons" desync → silent wrong results), which is already designed out on candide by the OpenMPI-5 image match, and adds a runtime check to core run.py — scope creep for what is an example-script modernization PR, especially while MPI's future is an open question for Martin (it's a hard mpi4py dependency used only by two example scripts and one config, by zero production paths). Keeps the exit-code propagation fix (33494d74), which is broad and unrelated. The guard's detection recipe (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS is unreliable) is preserved in the mpi-hybrid fiber as a ready follow-up if MPI is kept. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/shapepipe/pipeline/mpi_run.py | 43 --------------------- src/shapepipe/run.py | 11 +----- tests/unit/test_mpi_world.py | 62 ------------------------------- 3 files changed, 1 insertion(+), 115 deletions(-) delete mode 100644 tests/unit/test_mpi_world.py diff --git a/src/shapepipe/pipeline/mpi_run.py b/src/shapepipe/pipeline/mpi_run.py index f03a45e68..3e3684024 100644 --- a/src/shapepipe/pipeline/mpi_run.py +++ b/src/shapepipe/pipeline/mpi_run.py @@ -6,52 +6,9 @@ """ -import os - from shapepipe.pipeline.worker_handler import WorkerHandler -def check_mpi_world(comm): - """Check MPI World. - - Verify that the MPI world formed at the size the launcher requested, and - abort loudly otherwise. - - This guards against the "N rank-0 singletons" failure: when the host MPI - launcher and the container's MPI / PMIx stack are incompatible, each - process initialises standalone (``COMM_WORLD`` size 1, rank 0). ShapePipe - would then treat every process as master, hand each the full job list, and - run that many uncoordinated copies of the pipeline into the same output - directory -- silently, with exit code 0. Comparing the size that actually - wired up against the size the launcher intended (``OMPI_COMM_WORLD_SIZE``, - which is set per process even when the world fails to form) turns that - silent corruption into an immediate, legible error. - - Parameters - ---------- - comm : MPI.Comm - MPI communicator instance (``MPI.COMM_WORLD``) - - Raises - ------ - RuntimeError - if the launcher requested more ranks than actually wired up - - """ - intended = os.environ.get("OMPI_COMM_WORLD_SIZE") - actual = comm.Get_size() - - if intended is not None and int(intended) != actual: - raise RuntimeError( - f"MPI world mismatch: the launcher requested {intended} ranks but " - f"only {actual} wired up (MPI_COMM_WORLD size {actual}). This is " - f"the 'rank 0 of N singletons' failure -- the host MPI launcher and " - f"the container's MPI/PMIx stack are incompatible. Aborting rather " - f"than running {intended} uncoordinated copies of the pipeline into " - f"the same output directory." - ) - - def split_mpi_jobs(jobs, batch_size): """Split MPI Jobs. diff --git a/src/shapepipe/run.py b/src/shapepipe/run.py index 1716abaf3..8212fdfb5 100644 --- a/src/shapepipe/run.py +++ b/src/shapepipe/run.py @@ -20,11 +20,7 @@ from shapepipe.pipeline.dependency_handler import DependencyHandler from shapepipe.pipeline.file_handler import FileHandler from shapepipe.pipeline.job_handler import JobHandler -from shapepipe.pipeline.mpi_run import ( - check_mpi_world, - split_mpi_jobs, - submit_mpi_jobs, -) +from shapepipe.pipeline.mpi_run import split_mpi_jobs, submit_mpi_jobs try: from mpi4py import MPI @@ -376,11 +372,6 @@ def run_mpi(pipe, comm): # Assign master node master = comm.rank == 0 - # Fail loudly if the MPI world did not form at the size the launcher - # requested (the "rank 0 of N singletons" launcher/container mismatch), - # rather than silently running redundant copies of the pipeline. - check_mpi_world(comm) - # Get the module to be run modules = pipe.modules if master else None modules = comm.bcast(modules, root=0) diff --git a/tests/unit/test_mpi_world.py b/tests/unit/test_mpi_world.py deleted file mode 100644 index bc2c24e12..000000000 --- a/tests/unit/test_mpi_world.py +++ /dev/null @@ -1,62 +0,0 @@ -"""Guard against the silent "rank 0 of N singletons" MPI failure. - -When the host MPI launcher and the container's MPI/PMIx stack are -incompatible, every process initialises standalone (``COMM_WORLD`` size 1), -and ShapePipe would otherwise run N uncoordinated copies of the pipeline into -the same output directory -- silently, with exit code 0. -``check_mpi_world`` turns that into an immediate error by comparing the size -that actually wired up against the launcher's intended ``OMPI_COMM_WORLD_SIZE``. -The intended/actual pairs below are the values measured on candide for a -healthy run and for the OpenMPI-4-container / OpenMPI-5-host mismatch. -""" - -import pytest - -from shapepipe.pipeline.mpi_run import check_mpi_world - - -class _FakeComm: - """Minimal stand-in exposing only ``Get_size``.""" - - def __init__(self, size): - - self._size = size - - def Get_size(self): - - return self._size - - -@pytest.mark.parametrize( - "intended, actual", - [ - ("4", 4), # healthy multi-node run - (None, 1), # no launcher env -> legitimate single-rank run - ("1", 1), # explicit single rank - ], - ids=["healthy", "no-launcher-env", "single-rank"], -) -def test_check_mpi_world_passes(monkeypatch, intended, actual): - - if intended is None: - monkeypatch.delenv("OMPI_COMM_WORLD_SIZE", raising=False) - else: - monkeypatch.setenv("OMPI_COMM_WORLD_SIZE", intended) - - check_mpi_world(_FakeComm(actual)) - - -@pytest.mark.parametrize( - "intended, actual", - [ - ("4", 1), # the measured singleton failure - ("4", 3), # partial wire-up - ], - ids=["singletons", "partial-wireup"], -) -def test_check_mpi_world_aborts_on_mismatch(monkeypatch, intended, actual): - - monkeypatch.setenv("OMPI_COMM_WORLD_SIZE", intended) - - with pytest.raises(RuntimeError, match="MPI world mismatch"): - check_mpi_world(_FakeComm(actual)) From 8a0bbe55840d289df57f786819ad97a3e30b24f5 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 19:12:40 +0200 Subject: [PATCH 17/20] felt: park the singleton guard as follow-up; sharpen Martin question Guard pulled from #737 (scope creep on a maybe-retired mode); exit-code fix kept. Recipe preserved in Layer 4. Question to Martin sharpened to 'is MPI a used dependency at all?' with the full footprint: hard mpi4py dep, 2 example scripts, 1 config, 0 production paths. Co-Authored-By: Claude Opus 4.8 (1M context) --- .felt/shapepipe/mpi-hybrid/mpi-hybrid.md | 66 +++++++++++++----------- 1 file changed, 37 insertions(+), 29 deletions(-) diff --git a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md index 7152f6137..8e436c30d 100644 --- a/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md +++ b/.felt/shapepipe/mpi-hybrid/mpi-hybrid.md @@ -20,12 +20,13 @@ outcome: |- names without the _runner suffix → "No module named python_example" + a 5-min deadlock. Fixed in 7e7b7448. All three drifted undetected because the repo's exercised path is SMP, not MPI ([[shapepipe/exec-modes-schedulers]]); actual MPI run history (esp. canfar) is - unknown from here. HARDENING PASS added a preflight guard (check_mpi_world, 2289e6a7: aborts - when OMPI_COMM_WORLD_SIZE != COMM_WORLD size — the singleton signature; SLURM_NTASKS is NOT - reliable for this) and, found while testing it, fixed a swallowed exit code (33494d74: main() - now returns run()'s value — every caught error had been exiting 0). Both tested + verified on - a real allocation. STILL OPEN: deadlock when rank 0 fails mid-setup for non-singleton reasons. - REMAINING: Martin review + merge of #737; open question whether MPI should be retired. + unknown from here. HARDENING PASS: KEPT a swallowed-exit-code fix (33494d74: main() now returns + run()'s value — every caught error had been exiting 0, broad + unrelated to MPI). PROTOTYPED + then PULLED a singleton preflight guard (check_mpi_world: abort when OMPI_COMM_WORLD_SIZE != + COMM_WORLD size — SLURM_NTASKS unreliable) — verified working but removed as scope creep on a + maybe-retired mode; recipe parked in Layer 4. STILL OPEN: rank-0 mid-setup deadlock. REMAINING: + Martin review + merge of #737; sharpened question — is MPI a used dependency at all? (hard + mpi4py dep, 2 example scripts, 1 config, 0 production paths). --- ## The problem @@ -140,11 +141,13 @@ scripts work #737 started. launcher worked. Shipped in the published image (CI rebuild). 6. **Stale example config fix** (`7e7b7448`) — `config_mpi.ini` module names `*_runner`-suffixed to match the loader; surfaced running the real script. -7. **MPI singleton preflight guard** (`2289e6a7`) — `check_mpi_world()` aborts on - `OMPI_COMM_WORLD_SIZE` ≠ `COMM_WORLD` size; unit + real-allocation tested. -8. **Exit-code propagation fix** (`33494d74`) — `main()` returns `run()`'s value; +7. **Exit-code propagation fix** (`33494d74`) — `main()` returns `run()`'s value; every caught error had been exiting 0. + regression test. +Pulled from this PR (parked follow-up, gated on MPI being kept): the +`check_mpi_world()` singleton preflight guard — prototyped + verified, recipe in +Layer 4 above. + ## Empirical close (2026-05-31) — two layers The fix turned out to have **two independent layers**. The launcher fix @@ -201,26 +204,31 @@ because the repo's exercised path is SMP, not MPI. ## Layer 4 — silent-failure hardening (the "warning sign") A deeper pass on the singleton failure (option 5 in the spectrum above) turned -up two more silent-failure paths and fixed both: - -**(a) No preflight guard against the singleton signature.** In the singleton -case every process is master, `split_mpi_jobs(list, 1)` hands each the *full* -job list, and they all run the whole pipeline into the same output dir — N -uncoordinated copies, exit 0, plausible-but-wrong. Added `check_mpi_world()` -(`mpi_run.py`, called at the top of `run_mpi`): compares the size that wired up -(`COMM_WORLD`) against the size the launcher intended (`OMPI_COMM_WORLD_SIZE`) -and aborts on a mismatch. Empirically: **`SLURM_NTASKS` is NOT usable** for this -— it reads `1` on remote-node ranks even in a healthy run — `OMPI_COMM_WORLD_SIZE` -is the reliable signal (it's `4` in both healthy and singleton; only `COMM_WORLD` -differs). Commit `2289e6a7`, unit-tested + verified on a real allocation (healthy -passes; OMPI-4-image-under-OMPI-5-host fires the abort). - -**(b) Swallowed exit code (the bigger one).** Testing (a) end-to-end exposed that -the guard fired and logged loudly but the job *still exited 0*: `main()` in -`shapepipe_run.py` called `run(args)` without returning it, so `exit(main())` was -always `exit(None)` → 0. **Every caught error in ShapePipe — not just MPI — has -been exiting 0**, invisible to `exit $?` and CI. Fixed to `return run(args)` -(`33494d74`) + regression test. With both, the broken case now exits 1. +up two more silent-failure paths. One was kept; one was prototyped, verified, +then deliberately pulled back out (see below). + +**(a) Swallowed exit code — KEPT (`33494d74`).** `main()` in `shapepipe_run.py` +called `run(args)` without returning it, so `exit(main())` was always +`exit(None)` → 0. **Every caught error in ShapePipe — not just MPI — has been +exiting 0**, invisible to `exit $?` and CI. Fixed to `return run(args)` + +regression test. Broad, simple, unrelated to MPI's fate, so it stays. + +**(b) Singleton preflight guard — PROTOTYPED, then PULLED (`2289e6a7` reverted).** +In the singleton case every process is master, `split_mpi_jobs(list, 1)` hands +each the *full* job list, and they all run the whole pipeline into the same +output dir — N uncoordinated copies, exit 0, plausible-but-wrong. The exit-code +fix does **not** catch this: singletons don't raise, they "succeed" wrongly. A +`check_mpi_world()` preflight was written and verified on a real allocation +(healthy passes; OMPI-4-under-OMPI-5-host fires + exits non-zero). It was then +removed from #737 as scope creep: the failure is already designed out on candide +by the OpenMPI-5 match, it adds a runtime check to core `run.py`, and MPI's +future is an open question (a hard `mpi4py` dependency used by only 2 example +scripts — candide + ccin2p3 `cc_mpi.sh` — 1 config, and 0 production paths). +**Recipe, if MPI is kept:** at the top of `run_mpi`, abort when +`int(os.environ["OMPI_COMM_WORLD_SIZE"]) != comm.Get_size()`. The hard-won part +is that signal choice — **`SLURM_NTASKS` is NOT usable** (reads `1` on +remote-node ranks even when healthy); `OMPI_COMM_WORLD_SIZE` is `4` in both +healthy and singleton, only `COMM_WORLD` differs. **Still open (distinct gap):** when rank 0 fails *mid-setup* for a non-singleton reason (e.g. the stale-config module error in Layer 3), ranks 1..N block in the From 52f0e415a7bff0a282cc65f3a8a5c81e0489eedd Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 19:24:49 +0200 Subject: [PATCH 18/20] docs(README): add verified candide container quickstart MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The job scripts assumed the runtime SIF already existed; the pull step and candide's home-quota gotcha (point APPTAINER_CACHEDIR at a data partition or `apptainer pull` dies on $HOME) lived only in CLAUDE.md and felt. Add a copy-paste "Quickstart on a cluster (candide)" to the README — the on-ramp a newcomer actually reads — covering clone -> quota-safe pull -> sbatch the example, pointing at example/pbs and docs/source/container.md for depth. Verified end to end on candide (c03, apptainer 1.4.5, SLURM): the exact quickstart command form runs candide_smp.sh against the published :develop-runtime image -> job COMPLETED, ExitCode 0:0, "A total of 0 errors were recorded". Also refresh the stale python-3.9 badge to 3.12 (the shipped interpreter) and drop a stray character from its target URL. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.rst | 43 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/README.rst b/README.rst index da8fe0701..ea78e6145 100644 --- a/README.rst +++ b/README.rst @@ -1,7 +1,7 @@ ShapePipe ========= -|CI| |CD| |python39| |release| +|CI| |CD| |python312| |release| .. |CI| image:: https://github.com/CosmoStat/shapepipe/workflows/CI/badge.svg :target: https://github.com/CosmoStat/shapepipe/actions?query=workflow%3ACI @@ -9,8 +9,8 @@ ShapePipe .. |CD| image:: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment/badge.svg :target: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment -.. |python39| image:: https://img.shields.io/badge/python-3.9-green.svg - :target: https://www.python.org/‰ +.. |python312| image:: https://img.shields.io/badge/python-3.12-green.svg + :target: https://www.python.org/ .. |release| image:: https://img.shields.io/github/v/release/CosmoStat/shapepipe :target: https://github.com/CosmoStat/shapepipe/releases/latest @@ -20,3 +20,40 @@ CosmoStat lab at CEA Paris-Saclay. See the `documentation `_ for details on how to install and run ShapePipe. + +Quickstart on a cluster (candide) +--------------------------------- + +ShapePipe ships as a container image — the supported way to run it (see +``docs/source/container.md``). On a SLURM cluster such as candide, pull the slim +``runtime`` image once and submit the bundled example, which runs the pipeline +on a single CFIS tile: + +.. code-block:: bash + + # 0. Get a clone (holds the example configs, data, and job scripts). + git clone https://github.com/CosmoStat/shapepipe.git + cd shapepipe + + # 1. Keep the SIF and Apptainer's scratch off the quota-limited $HOME. + # candide's home quota is tight; a pull there fails with "disk quota + # exceeded". Point both at a roomy data partition instead. + export DATA=/n17data/$USER # adjust to your data partition + export APPTAINER_CACHEDIR=$DATA/.apptainer + + # 2. Pull the runtime image (≈850 MB). + apptainer pull "$DATA/shapepipe-runtime.sif" \ + docker://ghcr.io/cosmostat/shapepipe:develop-runtime + + # 3. Submit the example pipeline (SMP, single node). + SP_IMAGE="$DATA/shapepipe-runtime.sif" SPDIR="$PWD" \ + sbatch example/pbs/candide_smp.sh + +A clean run logs ``A total of 0 errors were recorded`` and exits ``0``. To span +multiple nodes with hybrid MPI, swap in ``example/pbs/candide_mpi.sh`` (same two +variables) — see the comments in each script for the SLURM directives. + +The ``:develop-runtime`` tag tracks the integration branch; for a stable cut use +a release tag (e.g. ``:v1.1.0-runtime``). The interactive ``dev`` image (no +``-runtime`` suffix) carries ``vim``, ``pytest``, and the full toolchain for +working *inside* the container; ``docs/source/container.md`` covers both. From 981c8e0d01d97ae176a3f2ff2b676f7c2fc22821 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 19:33:47 +0200 Subject: [PATCH 19/20] docs: make README a general front door; move candide detail to container.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The README had tunnel-visioned onto a candide-specific SLURM walkthrough — wrong altitude for the project's landing page, where the broader community arrives. Restructure it as a front door: a one-sentence-deeper description of what ShapePipe does, a Quickstart that runs the bundled example straight from the published container in one command (apptainer or docker, no install, no cluster specifics), the image tag scheme, and a Documentation signpost to the published pages. Move the candide cluster walkthrough (quota-safe pull -> sbatch candide_smp.sh, the SPDIR bind-mount, the MPI PMIx note) into a new "Running on a cluster (SLURM)" section in container.md, which the README links to. Drop the test-assertion prose ("logs ... and exits 0") that read like a CI check rather than user docs. Both quickstart commands verified on candide against :develop-runtime (including the no-pre-pull `apptainer exec docker://...` form): the bundled example runs to completion, 0 errors. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.rst | 58 +++++++++++++++++++--------------------- docs/source/container.md | 38 ++++++++++++++++++++++++++ 2 files changed, 65 insertions(+), 31 deletions(-) diff --git a/README.rst b/README.rst index ea78e6145..c92bdf0b6 100644 --- a/README.rst +++ b/README.rst @@ -16,44 +16,40 @@ ShapePipe :target: https://github.com/CosmoStat/shapepipe/releases/latest ShapePipe is a galaxy shape measurement pipeline developed within the -CosmoStat lab at CEA Paris-Saclay. +CosmoStat lab at CEA Paris-Saclay. It runs the full chain from raw survey +images to calibrated shear catalogues — object detection, PSF modelling, and +shape measurement — and produced the first UNIONS cosmic-shear release. -See the `documentation `_ for details -on how to install and run ShapePipe. +Quickstart +---------- -Quickstart on a cluster (candide) ---------------------------------- - -ShapePipe ships as a container image — the supported way to run it (see -``docs/source/container.md``). On a SLURM cluster such as candide, pull the slim -``runtime`` image once and submit the bundled example, which runs the pipeline -on a single CFIS tile: +ShapePipe ships as a container image, so you can run the bundled example +pipeline — a single CFIS tile through the full chain — without installing +anything: .. code-block:: bash - # 0. Get a clone (holds the example configs, data, and job scripts). - git clone https://github.com/CosmoStat/shapepipe.git - cd shapepipe + # Apptainer (HPC, no root needed): + apptainer exec docker://ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example + + # ...or Docker: + docker run --rm ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example - # 1. Keep the SIF and Apptainer's scratch off the quota-limited $HOME. - # candide's home quota is tight; a pull there fails with "disk quota - # exceeded". Point both at a roomy data partition instead. - export DATA=/n17data/$USER # adjust to your data partition - export APPTAINER_CACHEDIR=$DATA/.apptainer +The image is published on every push to the `GitHub Container Registry +`_: +``:develop`` tracks the integration branch, release tags (e.g. ``:v1.1.0``) a +stable cut, and the ``-runtime`` suffix selects the slim batch image over the +full interactive one. - # 2. Pull the runtime image (≈850 MB). - apptainer pull "$DATA/shapepipe-runtime.sif" \ - docker://ghcr.io/cosmostat/shapepipe:develop-runtime +Documentation +------------- - # 3. Submit the example pipeline (SMP, single node). - SP_IMAGE="$DATA/shapepipe-runtime.sif" SPDIR="$PWD" \ - sbatch example/pbs/candide_smp.sh +Full documentation lives at https://cosmostat.github.io/shapepipe. Good places +to start: -A clean run logs ``A total of 0 errors were recorded`` and exits ``0``. To span -multiple nodes with hybrid MPI, swap in ``example/pbs/candide_mpi.sh`` (same two -variables) — see the comments in each script for the SLURM directives. +- `Installation `_ — getting ShapePipe onto your machine or cluster. +- `Basic execution `_ and `configuration `_ — running ``shapepipe_run`` and writing pipeline configs. +- `Container workflow `_ — the two image targets, the ``pyproject.toml`` / ``uv.lock`` / ``Dockerfile`` layers, and how to run on a SLURM cluster (with a worked candide example). -The ``:develop-runtime`` tag tracks the integration branch; for a stable cut use -a release tag (e.g. ``:v1.1.0-runtime``). The interactive ``dev`` image (no -``-runtime`` suffix) carries ``vim``, ``pytest``, and the full toolchain for -working *inside* the container; ``docs/source/container.md`` covers both. +If you use ShapePipe in academic work, please cite Guinot et al. (2022) and +Farrens et al. (2022). diff --git a/docs/source/container.md b/docs/source/container.md index 2511a7f3e..405f09c40 100644 --- a/docs/source/container.md +++ b/docs/source/container.md @@ -91,6 +91,44 @@ in, all because something tries to write under `/app` or `$HOME`: If you bypass `/tmp` (e.g. with a custom apptainer profile) you may need to override these manually. +## Running on a cluster (SLURM) + +On a batch cluster you pull the slim `runtime` image once to a SIF, then +submit a job that runs `shapepipe_run` through it. The repo ships ready +SLURM scripts for the **candide** cluster in `example/pbs/` — +`candide_smp.sh` (single node) and `candide_mpi.sh` (multi-node hybrid +MPI) — that you can copy and adapt. The example below runs the bundled +single-tile pipeline end to end: + +```bash +# 1. Keep the SIF and Apptainer's scratch off the quota-limited $HOME. +# On candide a pull under $HOME fails with "disk quota exceeded"; +# point both at a roomy data partition instead. +export DATA=/n17data/$USER # adjust to your data partition +export APPTAINER_CACHEDIR=$DATA/.apptainer + +# 2. Pull the runtime image (~850 MB). +apptainer pull "$DATA/shapepipe-runtime.sif" \ + docker://ghcr.io/cosmostat/shapepipe:develop-runtime + +# 3. Submit the example pipeline. SPDIR is your local clone; it is +# bind-mounted at the same path inside the container so the config's +# $SPDIR-relative input/output directories resolve identically in and +# out of the container. +SP_IMAGE="$DATA/shapepipe-runtime.sif" SPDIR="/path/to/shapepipe" \ + sbatch example/pbs/candide_smp.sh +``` + +Both job scripts read `SP_IMAGE` (the SIF) and `SPDIR` (the clone) from +the environment, so the same script serves the example and a real run — +point the config inside the script at your own pipeline. The MPI script +additionally needs the host's OpenMPI to match the container's PMIx wire +protocol; it `module load`s a compatible OpenMPI (the image ships the +5.0.x series), and the script's header comments explain the contract. + +Adapting to another SLURM cluster is mostly the `#SBATCH` directives and +the `module load` line — the `apptainer exec` invocation carries over. + ## Three configuration layers Three files determine what the image contains. Each has a clear role; the From bb48a44f4f8593a513f2b358fc2a6ef608a6d135 Mon Sep 17 00:00:00 2001 From: Cail Daley Date: Sun, 31 May 2026 23:35:44 +0200 Subject: [PATCH 20/20] Move user-facing docs to the docs-rework PR (#739) The README front door, the container.md 'Running on a cluster' section, and the basic_execution.md MPI docs are relocated to #739, which owns the full docs story (cluster docs now live in a dedicated clusters.md, so keeping the walkthrough here too would duplicate it). This PR keeps only the code/infra and the CLAUDE.md build-loop note that the container changes here introduce. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.rst | 45 +++++----------------------------- docs/source/basic_execution.md | 30 +++-------------------- docs/source/container.md | 38 ---------------------------- 3 files changed, 10 insertions(+), 103 deletions(-) diff --git a/README.rst b/README.rst index c92bdf0b6..da8fe0701 100644 --- a/README.rst +++ b/README.rst @@ -1,7 +1,7 @@ ShapePipe ========= -|CI| |CD| |python312| |release| +|CI| |CD| |python39| |release| .. |CI| image:: https://github.com/CosmoStat/shapepipe/workflows/CI/badge.svg :target: https://github.com/CosmoStat/shapepipe/actions?query=workflow%3ACI @@ -9,47 +9,14 @@ ShapePipe .. |CD| image:: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment/badge.svg :target: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment -.. |python312| image:: https://img.shields.io/badge/python-3.12-green.svg - :target: https://www.python.org/ +.. |python39| image:: https://img.shields.io/badge/python-3.9-green.svg + :target: https://www.python.org/‰ .. |release| image:: https://img.shields.io/github/v/release/CosmoStat/shapepipe :target: https://github.com/CosmoStat/shapepipe/releases/latest ShapePipe is a galaxy shape measurement pipeline developed within the -CosmoStat lab at CEA Paris-Saclay. It runs the full chain from raw survey -images to calibrated shear catalogues — object detection, PSF modelling, and -shape measurement — and produced the first UNIONS cosmic-shear release. +CosmoStat lab at CEA Paris-Saclay. -Quickstart ----------- - -ShapePipe ships as a container image, so you can run the bundled example -pipeline — a single CFIS tile through the full chain — without installing -anything: - -.. code-block:: bash - - # Apptainer (HPC, no root needed): - apptainer exec docker://ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example - - # ...or Docker: - docker run --rm ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example - -The image is published on every push to the `GitHub Container Registry -`_: -``:develop`` tracks the integration branch, release tags (e.g. ``:v1.1.0``) a -stable cut, and the ``-runtime`` suffix selects the slim batch image over the -full interactive one. - -Documentation -------------- - -Full documentation lives at https://cosmostat.github.io/shapepipe. Good places -to start: - -- `Installation `_ — getting ShapePipe onto your machine or cluster. -- `Basic execution `_ and `configuration `_ — running ``shapepipe_run`` and writing pipeline configs. -- `Container workflow `_ — the two image targets, the ``pyproject.toml`` / ``uv.lock`` / ``Dockerfile`` layers, and how to run on a SLURM cluster (with a worked candide example). - -If you use ShapePipe in academic work, please cite Guinot et al. (2022) and -Farrens et al. (2022). +See the `documentation `_ for details +on how to install and run ShapePipe. diff --git a/docs/source/basic_execution.md b/docs/source/basic_execution.md index 1f17aa598..9e7ca63b4 100644 --- a/docs/source/basic_execution.md +++ b/docs/source/basic_execution.md @@ -37,33 +37,11 @@ shapepipe_run -c ## Running the Pipeline with MPI ShapePipe can also use [mpi4py](https://mpi4py.readthedocs.io/en/stable/) -to spread work across multiple nodes of a cluster. Set `MODE = mpi` in the -`[EXECUTION]` section of the config and launch with an MPI runner: +for managing parallel processes on clusters with multiple nodes. +The `shapepipe_run` script can be run with MPI as follows ```bash -mpiexec -n shapepipe_run -c +mpiexec -n shapepipe_run ``` -where `` is the number of MPI processes to start. - -### Through the container (the supported way on a cluster) - -On a cluster you run ShapePipe from the published image as a standard Apptainer -*hybrid* MPI job: the **host** `mpirun`/`mpiexec` launches one container rank per -slot, and the OpenMPI bundled in the image wires the ranks together. - -```bash -# one-time: pull the runtime image -apptainer pull shapepipe.sif docker://ghcr.io/cosmostat/shapepipe:develop-runtime - -# load a host MPI in the same family as the image's OpenMPI (5.0.x), then launch -module load openmpi -mpirun -n \ - apptainer exec --bind "$PWD:$PWD" shapepipe.sif \ - shapepipe_run -c -``` - -The image ships **OpenMPI 5.0.x** so that its PMIx matches modern cluster -launchers. The host and container MPI must be compatible: if you see *N* copies -of `rank 0 of 1` instead of one *N*-rank job, load a host OpenMPI in the 5.0.x -family. See `example/pbs/candide_mpi.sh` for a complete SLURM batch script. +where `` is the number of cores to allocate to the run. diff --git a/docs/source/container.md b/docs/source/container.md index 405f09c40..2511a7f3e 100644 --- a/docs/source/container.md +++ b/docs/source/container.md @@ -91,44 +91,6 @@ in, all because something tries to write under `/app` or `$HOME`: If you bypass `/tmp` (e.g. with a custom apptainer profile) you may need to override these manually. -## Running on a cluster (SLURM) - -On a batch cluster you pull the slim `runtime` image once to a SIF, then -submit a job that runs `shapepipe_run` through it. The repo ships ready -SLURM scripts for the **candide** cluster in `example/pbs/` — -`candide_smp.sh` (single node) and `candide_mpi.sh` (multi-node hybrid -MPI) — that you can copy and adapt. The example below runs the bundled -single-tile pipeline end to end: - -```bash -# 1. Keep the SIF and Apptainer's scratch off the quota-limited $HOME. -# On candide a pull under $HOME fails with "disk quota exceeded"; -# point both at a roomy data partition instead. -export DATA=/n17data/$USER # adjust to your data partition -export APPTAINER_CACHEDIR=$DATA/.apptainer - -# 2. Pull the runtime image (~850 MB). -apptainer pull "$DATA/shapepipe-runtime.sif" \ - docker://ghcr.io/cosmostat/shapepipe:develop-runtime - -# 3. Submit the example pipeline. SPDIR is your local clone; it is -# bind-mounted at the same path inside the container so the config's -# $SPDIR-relative input/output directories resolve identically in and -# out of the container. -SP_IMAGE="$DATA/shapepipe-runtime.sif" SPDIR="/path/to/shapepipe" \ - sbatch example/pbs/candide_smp.sh -``` - -Both job scripts read `SP_IMAGE` (the SIF) and `SPDIR` (the clone) from -the environment, so the same script serves the example and a real run — -point the config inside the script at your own pipeline. The MPI script -additionally needs the host's OpenMPI to match the container's PMIx wire -protocol; it `module load`s a compatible OpenMPI (the image ships the -5.0.x series), and the script's header comments explain the contract. - -Adapting to another SLURM cluster is mostly the `#SBATCH` directives and -the `module load` line — the `apptainer exec` invocation carries over. - ## Three configuration layers Three files determine what the image contains. Each has a clear role; the