Skip to content

Modernize packaging, pipeline, and report generation workflows#341

Open
alsmith151 wants to merge 160 commits into
mainfrom
develop
Open

Modernize packaging, pipeline, and report generation workflows#341
alsmith151 wants to merge 160 commits into
mainfrom
develop

Conversation

@alsmith151

Copy link
Copy Markdown
Collaborator

This pull request introduces significant modernization and infrastructure improvements to the CapCruncher project. The main changes include adding a new Docker build system, updating CI workflows for Python 3.12 and newer GitHub Actions versions, improving documentation for both users and developers, and cleaning up legacy or unused code and dependencies. These changes aim to make the project easier to maintain, more robust across environments, and ready for current best practices in Python and workflow management.

Containerization and Deployment:

  • Added a production-ready Dockerfile based on micromamba, supporting both linux/amd64 and linux/arm64, and including key dependencies such as Apptainer and Quarto. This enables robust container-based workflows for both local and HPC environments.
  • Added a .dockerignore file to optimize Docker builds by excluding unnecessary files and directories.
  • Introduced a new GitHub Actions workflow (container-build.yml) for automated container builds, smoke testing, and publishing images to GitHub Container Registry.

Continuous Integration and Testing:

  • Updated the main CI workflow to use Python 3.12, upgraded all GitHub Actions to their latest major versions, and improved caching for dependencies and Bowtie2 indices. Also removed a direct dependency on a custom CoolBox fork in test setup. [1] [2] [3] [4]

Documentation and Developer Guidance:

  • Substantially revised README.md for clarity, modern usage, and quick-start instructions, including new sections on installation, CLI, and development.
  • Added an AGENTS.md file with detailed modernization, development, and workflow guidelines for contributors, including conventions, environment notes, and known caveats.

Codebase and Packaging Cleanup:

  • Removed legacy and unused API imports from capcruncher/api/__init__.py, reflecting a move away from exposing a monolithic API surface.
  • Ensured all Snakemake profile files are included in package distributions by updating MANIFEST.in.

These changes collectively modernize CapCruncher’s infrastructure, improve developer and user experience, and set up the project for reliable containerized and CI-driven workflows.

- Updated report_text.yml to change comments to proper headings and improved descriptions.
- Removed the copy_report_template rule from statistics.smk and replaced it with a script call to make_report.py.
- Enhanced the capcruncher_subprocess_env fixture in conftest.py to include the repository root in PYTHONPATH.
- Updated docker.md to reflect the removal of Quarto from the Docker image.
- Added plotly and pyyaml to environment.yml and requirements files.
- Updated pyproject.toml to include .py files for report generation and removed unnecessary ignores.
- Added tests for new report generation functionality and improved existing tests to ensure proper handling of viewpoint categories.
- Implemented functionality to prune unused viewpoint categories in interactions_deduplicate and slice filtering.
alsmith151 and others added 30 commits May 28, 2026 17:58
- bed.py: recover from SchemaError by dropping rows where end <= start
  instead of returning empty DataFrame; use pd.Series wrappers to avoid
  bool.__invert__ deprecation (Python 3.16)
- genome.py: filter zero-length fragments before sorting digest output
- conftest.py: strip --compress-prog-args from flash2 invocation; forward
  exit code from subprocess
- Run pre-commit install + pre-push hooks
- Fix broken symlink: viewpoints.bed -> mm9_capture_viewpoints_Slc25A37.bed
- Exclude cookiecutter config dir from check-yaml (% chars in template)
- Add strict=False to zip() calls in get_test_data.ipynb (B905)
- Auto-fix trailing whitespace, end-of-file, ruff, ruff-format, snakefmt
- conftest.py: remove flash/gzcat/gsplit/multiqc shims; keep only the
  capcruncher shim that points at the local checkout. Real tools are now
  taken directly from the pixi environment.
- fastq.smk: invoke flash2 directly; drop --compress-prog-args pigz
  which flash2 does not support.
- fastq.py / common.smk: on macOS accept the pixi-provided GNU split
  (unprefixed) in addition to gsplit; use gzip -dc instead of
  platform-specific zcat/gzcat for .gz decompression.
- plot.py: pass theme=None to GenomicFigure to work around a
  plotnado 0.3.1 bug where Theme.apply raises AttributeError on
  Spacer tracks that have no aesthetics field.
- test_workflow_scripts.py: update digest golden row count 303591→303397
  to reflect zero-length fragment filtering added in 1949bfc.
- docs/plotting.ipynb: add missing import pyranges1 as pr.
- cookiecutter.json / capcruncher_config.yml: add plotting_genes variable
  so the genes path is configurable at project creation rather than
  hardcoded to a placeholder that causes a plotnado error at plot time.
- plot.py: guard genes track with pathlib.Path.is_file() so a missing or
  placeholder path is silently skipped.
- test_pipeline.py: pass mm9_chr14_genes.bed as plotting_genes in both
  config fixtures; add data_path fixture parameter.
- pixi.toml: add --dist loadscope to pipeline/all test tasks so that
  pytest-xdist keeps all tests in the same module on the same worker —
  fixes test_stats_exist / test_bigwigs_exist / test_hub_exists failures
  caused by module-scoped fixtures being recreated on a different worker.
- test_workflow_scripts.py: update viewpoint_bins golden value 169744→169634
  to reflect the shift in fragment IDs after zero-length fragment removal
  (1949bfc).
…n pileup function

fix: add check for empty sorted bedgraph file in bedgraph_to_bigwig rule
test: update expected output values in capture_pipeline_golden_outputs test
Add MANIFEST.in to correctly prune non-runtime trees (docs, tests,
.github, lock files, Dockerfile) from the source distribution.

Tighten pyproject.toml: add inline comment explaining why the ray extra
is excluded from all (heavy footprint, not needed for standard pipeline).
Pin matplotlib >=3.10.9 in environment.yml to match pixi.toml. Update
pixi.toml and regenerate pixi.lock with current dependency resolutions.
Multi-stage build strips compilers and Rust from the runtime image.
Update .dockerignore to exclude tests, docs, lock files, and CI config
from the build context so only runtime-necessary files are copied.
install-methods.yml: new workflow testing Python wheel, conda fallback,
Docker, and Apptainer installs on every PR. Uses uv throughout; all
smoke tests check --version before --help.

repo-health.yml: new weekly schedule testing published packages on
PyPI, Bioconda (gracefully warns if lagging), and Docker registry.

CD.yml: add verify-pypi job that installs the just-published wheel by
exact version from PyPI and runs smoke tests.

container-build.yml: minor smoke test update.
Verify MANIFEST.in pruning, modern license metadata, extras aggregate,
pyranges1-only environment, critical dependency bounds across all four
manifests, documentation priority ordering, Docker/Apptainer CI coverage,
and install-methods CI contract. All tests are static (no network/build).
Swap Apptainer above Docker throughout — most users are on HPC where
Apptainer is the native container runtime. Add a decision table at the
top of installation.md so non-technical users can identify their route
at a glance. Expand Apptainer section with offline .sif fallback
workflow. Replace pip with uv in the conda fallback install command.
…he mount

rm -rf /home/mambauser/.cache fails with 'Device or resource busy' because
the pip cache is a live BuildKit mount at that path. The home directory is
not copied to the runtime stage anyway, so the cleanup was a no-op.
…n docs

Part 1 — fix comparison separator:
- Use `_vs_` instead of `-` to join condition names in comparison filenames,
  resolving MissingInputException when conditions themselves contain hyphens
- Update COMPARISON_TRACK_PATTERN regex in make_ucsc_hub.py to match new separator
- Fix bigwig_summarised wildcard constraint (was `comparison=`, should be `group=`)
- Broaden visualise.smk wildcard constraints to allow hyphens and underscores

Part 2 — design matrix:
- Add Pandera DesignSchema (unique sample, no-dot condition check)
- Replace get_design_matrix() with infer_design_from_fastqs() using correct
  rsplit logic: condition=everything before last `_`, replicate=last token
- Fix FastqSamples.from_files() to delegate to infer_design_from_fastqs()
- Validate design on Snakefile startup; force COMPARE_SAMPLES=False when all
  conditions are UNKNOWN (no design provided and inference failed)
- Add `capcruncher pipeline design` subcommand to preview/save inferred design

Part 3 — genome profiles:
- Add `capcruncher genome add/list/show/remove` for per-genome YAML profiles
  stored in ~/.capcruncher/genomes/ (XDG_CONFIG_HOME aware)
- Resolve `genome: {profile: <name>}` in format_config_dict() before validation
- Add `genome_profile` field to cookiecutter config template
- Add `capcruncher pipeline config --list-profiles` shorthand

Docs — update all references from deprecated `capcruncher pipeline` /
`pipeline-init` / `pipeline-config` to `capcruncher pipeline run` /
`capcruncher pipeline init` / `capcruncher pipeline config`; document
design matrix convention, validation rules, and genome profiles
…ovements

- Add run summary scorecard (alignment %, capture efficiency %, cis %, viewpoints detected)
- Add capture efficiency, cis/trans ratio, viewpoint detection summary, and reads-per-viewpoint uniformity sections
- Add alignment filtering dropout (% retained) chart tab
- Add count_religation.py script and Snakemake rule to measure re-ligation artefacts and cis interaction distance distributions per viewpoint
- Fix slider label overlapping plots (hide currentvalue overlay, increase pad)
- Reduce left margin (150→80px), cap slice-length histogram at 99th percentile
- Simplify cis/trans chart (remove pattern_shape, use facet_col instead of facet_row)
- Add all-samples box-plot summary tab to pipeline run statistics
- Scale chart heights with sample/viewpoint count, capped at 800px
- Add loguru logging throughout report generation
- Update report_text.yml with descriptions for all new sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant