Skip to content

Docs: machine-specific cluster tree + freshness pass#739

Open
cailmdaley wants to merge 3 commits into
developfrom
docs/rework
Open

Docs: machine-specific cluster tree + freshness pass#739
cailmdaley wants to merge 3 commits into
developfrom
docs/rework

Conversation

@cailmdaley
Copy link
Copy Markdown
Contributor

@cailmdaley cailmdaley commented May 31, 2026

Update: this PR now also carries the README front door and the basic_execution.md MPI run docs, relocated from #737 so that all user-facing docs live here. The candide cluster walkthrough lives in clusters.md (not duplicated in container.md).


Audited every narrative docs page against the current code. The install / container / testing / API pages were already fresh (the conda→uv/container work kept them current); staleness concentrated in cluster docs and a handful of content errors. This PR fixes both.

Machine-specific cluster tree

Cluster guidance was scattered and half-invisible: candide lived only inside container.md (and only on the #737 branch), canfar was split across orphaned pages, and none of the canfar/candide pages were in the sidebar at all.

New single clusters.md under a "Running on a cluster" toctree caption:

  • The pattern — the shared truths: the container is the unit of execution, bind-mount your clone at the same path, keep SIFs/cache off a quota-limited $HOME.
  • candide (SLURM)sbatch, the candide_{smp,mpi}.sh scripts, the quota-safe pull → submit, partitions, the MPI/PMIx note.
  • CANFAR — the current model (canfar_submit_job / canfar_monitor console scripts), with the deep production walkthrough kept in pipeline_canfar.md (linked, and now in the toctree).
  • ccin2p3 — honest stub (not yet containerized).

Deleted obsolete pages

canfar.md (old curl-VM submission, superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing script), work_flow_v2.0.md (an unrealized planning wishlist) — all three orphaned. The v2.0 wishlist was preserved in the team's felt store before deletion.

Content fixes

  • dependencies.md — rewritten against pyproject.toml: reframed around the abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar, cs_util, astroquery, reproject, h5py, numba).
  • post_processing.md — dropped the removed rho-statistics step and the dead prepare_tiles_for_final command; legacy banner → sp_validation.
  • random_cat.md — legacy banner; fixed random_runnerrandom_cat_runner.
  • pipeline_canfar.md — flagged the matched-star / coverage-mask helpers that moved to sp_validation.
  • basic_execution.md — replaced the conda-era "activate the environment" framing with the container reality. MPI sections deferred pending the Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts #737 keep/drop decision.
  • Cosmetics: configuration.md (conifgconfig, NUMBERING_LISTNUMBER_LIST), contributing.md (PleasPlease), module_develop.md (src/shapepipe/modules).

Verification

Local sphinx-book-theme build succeeds. The one new warning the tree introduced (a clusters.md heading anchor) is fixed; remaining warnings are all pre-existing (the autosummary API page needs the installed package; the multiple-toctree notice fires on every page).

Relationship to the other docs PRs

— Claude on behalf of Cail

@cailmdaley cailmdaley requested review from martinkilbinger and sfarrens and removed request for sfarrens May 31, 2026 20:14
cailmdaley added a commit that referenced this pull request May 31, 2026
Three fibers from this session's docs work:
- docs-versioning: the versioned-site + switcher design (#738) and the
  recurring unexercised-path bit-rot pattern.
- docs-cluster-tree: the machine-specific clusters.md decision (#739) and why
  a single page beat a thin standalone general page.
- v2-run-plan: the v2.0 run wishlist rescued from the deleted
  work_flow_v2.0.md docs page before removal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
The README front door, the container.md 'Running on a cluster' section, and the
basic_execution.md MPI docs are relocated to #739, which owns the full docs
story (cluster docs now live in a dedicated clusters.md, so keeping the
walkthrough here too would duplicate it). This PR keeps only the code/infra and
the CLAUDE.md build-loop note that the container changes here introduce.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley and others added 3 commits May 31, 2026 23:41
Audited every narrative docs page against the current code. The install /
container / testing / API pages were already fresh; the staleness concentrated
in cluster docs and a few content errors. This rework:

**Machine-specific cluster tree.** Cluster guidance was scattered and half
of it invisible (candide lived only inside container.md on a feature branch;
canfar was split across orphaned pages; none of canfar/candide were in the
sidebar). Add a single `clusters.md` under a new "Running on a cluster" toctree
caption: the shared pattern (container = unit of execution, bind-mount, keep
SIFs off a quota-limited $HOME), then per-machine sections for candide (SLURM,
the candide_{smp,mpi}.sh scripts, the quota-safe pull, MPI/PMIx) and CANFAR
(the current canfar_submit_job / canfar_monitor console scripts), with ccin2p3
stubbed. The deep CANFAR production walkthrough stays in pipeline_canfar.md,
linked, and is now in the toctree too.

**Delete obsolete pages.** canfar.md (the old curl-VM submission model,
superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing
script), and work_flow_v2.0.md (an unrealized planning wishlist) — all three
orphaned from the toctree. The v2.0 wishlist is preserved in the team's felt
store rather than lost.

**Fix content errors.**
- dependencies.md: rewritten against pyproject.toml. Reframed around the
  abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points
  at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the
  phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar,
  cs_util, astroquery, reproject, h5py, numba).
- post_processing.md: dropped the removed rho-statistics step and the dead
  prepare_tiles_for_final command; added a legacy banner pointing at sp_validation.
- random_cat.md: legacy banner; fixed module name random_runner -> random_cat_runner.
- pipeline_canfar.md: flagged the matched-star / coverage-mask helpers that
  moved to sp_validation (merge_psf_cat.py, download_headers, …).
- basic_execution.md: replaced the conda-era "activate the environment" framing
  with the container reality. (MPI sections deferred pending the #737 decision.)
- configuration.md (conifg->config, NUMBERING_LIST->NUMBER_LIST),
  contributing.md (Pleas->Please), module_develop.md (src/shapepipe/modules).

Verified with a local sphinx-book-theme build: succeeds; the only new warning
the tree introduced (a clusters.md heading anchor) is fixed. Remaining warnings
are all pre-existing (the autosummary API page needs the installed package;
multiple-toctree notices on every page).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…itHub

The explicit MyST target showed as raw '(candide-slurm)=' in GitHub's blob
view (where PR links point readers). Use a plain-text in-page reference; the
candide section is still reachable via the sidebar and GitHub's own heading
anchor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unify all user-facing docs in this PR (relocated from #737, which is now pure
code/infra):
- README front door (Quickstart + Documentation signpost). The signpost now
  has a dedicated 'Running on a cluster' entry pointing at clusters.html, and
  the container-workflow entry no longer claims to carry the cluster example
  (that lives in clusters.md).
- basic_execution.md MPI section: the hybrid-Apptainer run pattern and the
  OpenMPI-5 PMIx note, kept alongside the conda-framing fix.
- container.md gains a one-line pointer to clusters.md.

This removes the container.md/clusters.md duplication at the source rather than
reconciling it after merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant