Modernize packaging, pipeline, and report generation workflows by alsmith151 · Pull Request #341 · sims-lab/CapCruncher

alsmith151 · 2026-04-28T14:55:08Z

This pull request introduces significant modernization and infrastructure improvements to the CapCruncher project. The main changes include adding a new Docker build system, updating CI workflows for Python 3.12 and newer GitHub Actions versions, improving documentation for both users and developers, and cleaning up legacy or unused code and dependencies. These changes aim to make the project easier to maintain, more robust across environments, and ready for current best practices in Python and workflow management.

Containerization and Deployment:

Added a production-ready Dockerfile based on micromamba, supporting both linux/amd64 and linux/arm64, and including key dependencies such as Apptainer and Quarto. This enables robust container-based workflows for both local and HPC environments.
Added a .dockerignore file to optimize Docker builds by excluding unnecessary files and directories.
Introduced a new GitHub Actions workflow (container-build.yml) for automated container builds, smoke testing, and publishing images to GitHub Container Registry.

Continuous Integration and Testing:

Updated the main CI workflow to use Python 3.12, upgraded all GitHub Actions to their latest major versions, and improved caching for dependencies and Bowtie2 indices. Also removed a direct dependency on a custom CoolBox fork in test setup. [1] [2] [3] [4]

Documentation and Developer Guidance:

Substantially revised README.md for clarity, modern usage, and quick-start instructions, including new sections on installation, CLI, and development.
Added an AGENTS.md file with detailed modernization, development, and workflow guidelines for contributors, including conventions, environment notes, and known caveats.

Codebase and Packaging Cleanup:

Removed legacy and unused API imports from capcruncher/api/__init__.py, reflecting a move away from exposing a monolithic API surface.
Ensured all Snakemake profile files are included in package distributions by updating MANIFEST.in.

These changes collectively modernize CapCruncher’s infrastructure, improve developer and user experience, and set up the project for reliable containerized and CI-driven workflows.

- Updated report_text.yml to change comments to proper headings and improved descriptions. - Removed the copy_report_template rule from statistics.smk and replaced it with a script call to make_report.py. - Enhanced the capcruncher_subprocess_env fixture in conftest.py to include the repository root in PYTHONPATH. - Updated docker.md to reflect the removal of Quarto from the Docker image. - Added plotly and pyyaml to environment.yml and requirements files. - Updated pyproject.toml to include .py files for report generation and removed unnecessary ignores. - Added tests for new report generation functionality and improved existing tests to ensure proper handling of viewpoint categories. - Implemented functionality to prune unused viewpoint categories in interactions_deduplicate and slice filtering.

- bed.py: recover from SchemaError by dropping rows where end <= start instead of returning empty DataFrame; use pd.Series wrappers to avoid bool.__invert__ deprecation (Python 3.16) - genome.py: filter zero-length fragments before sorting digest output - conftest.py: strip --compress-prog-args from flash2 invocation; forward exit code from subprocess

- Run pre-commit install + pre-push hooks - Fix broken symlink: viewpoints.bed -> mm9_capture_viewpoints_Slc25A37.bed - Exclude cookiecutter config dir from check-yaml (% chars in template) - Add strict=False to zip() calls in get_test_data.ipynb (B905) - Auto-fix trailing whitespace, end-of-file, ruff, ruff-format, snakefmt

- conftest.py: remove flash/gzcat/gsplit/multiqc shims; keep only the capcruncher shim that points at the local checkout. Real tools are now taken directly from the pixi environment. - fastq.smk: invoke flash2 directly; drop --compress-prog-args pigz which flash2 does not support. - fastq.py / common.smk: on macOS accept the pixi-provided GNU split (unprefixed) in addition to gsplit; use gzip -dc instead of platform-specific zcat/gzcat for .gz decompression. - plot.py: pass theme=None to GenomicFigure to work around a plotnado 0.3.1 bug where Theme.apply raises AttributeError on Spacer tracks that have no aesthetics field. - test_workflow_scripts.py: update digest golden row count 303591→303397 to reflect zero-length fragment filtering added in 1949bfc. - docs/plotting.ipynb: add missing import pyranges1 as pr.

- cookiecutter.json / capcruncher_config.yml: add plotting_genes variable so the genes path is configurable at project creation rather than hardcoded to a placeholder that causes a plotnado error at plot time. - plot.py: guard genes track with pathlib.Path.is_file() so a missing or placeholder path is silently skipped. - test_pipeline.py: pass mm9_chr14_genes.bed as plotting_genes in both config fixtures; add data_path fixture parameter. - pixi.toml: add --dist loadscope to pipeline/all test tasks so that pytest-xdist keeps all tests in the same module on the same worker — fixes test_stats_exist / test_bigwigs_exist / test_hub_exists failures caused by module-scoped fixtures being recreated on a different worker. - test_workflow_scripts.py: update viewpoint_bins golden value 169744→169634 to reflect the shift in fragment IDs after zero-length fragment removal (1949bfc).

…n pileup function fix: add check for empty sorted bedgraph file in bedgraph_to_bigwig rule test: update expected output values in capture_pipeline_golden_outputs test

Add MANIFEST.in to correctly prune non-runtime trees (docs, tests, .github, lock files, Dockerfile) from the source distribution. Tighten pyproject.toml: add inline comment explaining why the ray extra is excluded from all (heavy footprint, not needed for standard pipeline).

Pin matplotlib >=3.10.9 in environment.yml to match pixi.toml. Update pixi.toml and regenerate pixi.lock with current dependency resolutions.

Multi-stage build strips compilers and Rust from the runtime image. Update .dockerignore to exclude tests, docs, lock files, and CI config from the build context so only runtime-necessary files are copied.

install-methods.yml: new workflow testing Python wheel, conda fallback, Docker, and Apptainer installs on every PR. Uses uv throughout; all smoke tests check --version before --help. repo-health.yml: new weekly schedule testing published packages on PyPI, Bioconda (gracefully warns if lagging), and Docker registry. CD.yml: add verify-pypi job that installs the just-published wheel by exact version from PyPI and runs smoke tests. container-build.yml: minor smoke test update.

Verify MANIFEST.in pruning, modern license metadata, extras aggregate, pyranges1-only environment, critical dependency bounds across all four manifests, documentation priority ordering, Docker/Apptainer CI coverage, and install-methods CI contract. All tests are static (no network/build).

Swap Apptainer above Docker throughout — most users are on HPC where Apptainer is the native container runtime. Add a decision table at the top of installation.md so non-technical users can identify their route at a glance. Expand Apptainer section with offline .sif fallback workflow. Replace pip with uv in the conda fallback install command.

…workstations

…iner does not exist

…he mount rm -rf /home/mambauser/.cache fails with 'Device or resource busy' because the pip cache is a live BuildKit mount at that path. The home directory is not copied to the runtime stage anyway, so the cleanup was a no-op.

…n docs Part 1 — fix comparison separator: - Use `_vs_` instead of `-` to join condition names in comparison filenames, resolving MissingInputException when conditions themselves contain hyphens - Update COMPARISON_TRACK_PATTERN regex in make_ucsc_hub.py to match new separator - Fix bigwig_summarised wildcard constraint (was `comparison=`, should be `group=`) - Broaden visualise.smk wildcard constraints to allow hyphens and underscores Part 2 — design matrix: - Add Pandera DesignSchema (unique sample, no-dot condition check) - Replace get_design_matrix() with infer_design_from_fastqs() using correct rsplit logic: condition=everything before last `_`, replicate=last token - Fix FastqSamples.from_files() to delegate to infer_design_from_fastqs() - Validate design on Snakefile startup; force COMPARE_SAMPLES=False when all conditions are UNKNOWN (no design provided and inference failed) - Add `capcruncher pipeline design` subcommand to preview/save inferred design Part 3 — genome profiles: - Add `capcruncher genome add/list/show/remove` for per-genome YAML profiles stored in ~/.capcruncher/genomes/ (XDG_CONFIG_HOME aware) - Resolve `genome: {profile: <name>}` in format_config_dict() before validation - Add `genome_profile` field to cookiecutter config template - Add `capcruncher pipeline config --list-profiles` shorthand Docs — update all references from deprecated `capcruncher pipeline` / `pipeline-init` / `pipeline-config` to `capcruncher pipeline run` / `capcruncher pipeline init` / `capcruncher pipeline config`; document design matrix convention, validation rules, and genome profiles

…dd snakemake plugins

…tions

…for fastq files

…paths

…lumns in input data

…ovements - Add run summary scorecard (alignment %, capture efficiency %, cis %, viewpoints detected) - Add capture efficiency, cis/trans ratio, viewpoint detection summary, and reads-per-viewpoint uniformity sections - Add alignment filtering dropout (% retained) chart tab - Add count_religation.py script and Snakemake rule to measure re-ligation artefacts and cis interaction distance distributions per viewpoint - Fix slider label overlapping plots (hide currentvalue overlay, increase pad) - Reduce left margin (150→80px), cap slice-length histogram at 99th percentile - Simplify cis/trans chart (remove pattern_shape, use facet_col instead of facet_row) - Add all-samples box-plot summary tab to pipeline run statistics - Scale chart heights with sample/viewpoint count, capped at 800px - Add loguru logging throughout report generation - Update report_text.yml with descriptions for all new sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…for Snakemake

alsmith151 added 30 commits April 24, 2026 14:47

feat: modernize packaging and pipeline presets

9cf9b84

refactor: remove ibis query layer

09a97ee

refactor: migrate intervals to pyranges1

1c66d9a

Modernise PyRanges interval annotation

1f5642f

refactor: modernize viewpoint validation script

45d99be

chore: relax native environment pins

1623d85

refactor: make viewpoint counting script testable

0762fe8

refactor: update workflow presets for snakemake 9

5e1ab8c

feat: add ghcr container build workflow

359451e

fix: harden workflow resource and checkpoint handling

82297f7

fix: include workflow runtime assets

d86224a

refactor: modernize track hub generation

403bb0f

refactor: migrate plotting to plotnado

0a80c35

docs: refresh installation guidance

8ed857c

chore: prune unused dependencies

96a3381

refactor: align hub generation with tracknado extractors

213f5cf

docs: point plotting customization to plotnado

3c224ba

feat: modernize pipeline profiles and container docs

9a46b17

test: cover brittle pipeline modernization paths

23987f0

chore: add plans directory to .gitignore

c822b22

refactor: remove deduplication module and clean up imports in API files

dac9765

test: exercise workflow scripts on pipeline outputs

14c8781

chore: add multiqc and pigz to environment dependencies

6e0b42f

test: include cooler workflow script output

b516f03

build: use flash2 conda package

fa6b3b3

build: remove unused tqdm dependency

603f16e

feat: add parquet file handling and update tests for cis and trans stats

33e3e52

fix: update color scheme for report visualization

8de8ced

refactor: isolate ray behind optional executor

ccbc4d3

alsmith151 and others added 30 commits May 28, 2026 17:58

feat: add flash script and update pytest parallelism in pixi.toml

f4ca6b6

fix: improve logging for missing viewpoints and handle empty output i…

aed0406

…n pileup function fix: add check for empty sorted bedgraph file in bedgraph_to_bigwig rule test: update expected output values in capture_pipeline_golden_outputs test

build: sync environment.yml, pixi.toml, and lock files

fd87955

Pin matplotlib >=3.10.9 in environment.yml to match pixi.toml. Update pixi.toml and regenerate pixi.lock with current dependency resolutions.

build: improve Docker image build and runtime scope

94dcb07

Multi-stage build strips compilers and Rust from the runtime image. Update .dockerignore to exclude tests, docs, lock files, and CI config from the build context so only runtime-necessary files are copied.

docs: update installation recommendations for HPC clusters and Linux …

7be5081

…workstations

fix(ci): use eWaterCycle/setup-apptainer@v2 — apptainer/install-appta…

6fd9409

…iner does not exist

feat: update Dockerfile and environment.yml to remove apptainer and a…

146fd77

…dd snakemake plugins

refactor: simplify pipeline command handling and remove deprecated op…

c714087

…tions

feat: add pandera and pydantic as dependencies for data validation

d0ed6e7

feat: expand hpc dependencies for snakemake plugins

65eab19

refactor: update pipeline command syntax for consistency and clarity

bb8b95d

feat: enhance deduplication rule to support flexible input arguments …

387e2bb

…for fastq files

fix: improve path existence check to handle comma-separated lists of …

9e1ef89

…paths

fix: handle space-separated file lists in run_unix_split command

3c3e35d

fix: update summarise function to apply schema overrides for float co…

d81a245

…lumns in input data

fix: remove default logger and configure logging for snakemake execution

6c3f4b5

chore: move as many pip dependencies as possible to conda based

3bb8953

fix: remove printshellcmds from various profile configurations

78785f4

fix: update CSV reading to infer schema length and configure logging …

040bbe6

…for Snakemake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize packaging, pipeline, and report generation workflows#341

Modernize packaging, pipeline, and report generation workflows#341
alsmith151 wants to merge 160 commits into
mainfrom
develop

alsmith151 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alsmith151 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant