Skip to content

Nextflow 26.04.3 and strict parser migration, other refactoring#86

Merged
dltamayo merged 18 commits into
mainfrom
dltamayo-dev
Jun 17, 2026
Merged

Nextflow 26.04.3 and strict parser migration, other refactoring#86
dltamayo merged 18 commits into
mainfrom
dltamayo-dev

Conversation

@dltamayo

@dltamayo dltamayo commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

This pull request introduces several significant updates and refactorings to the TCRtoolkit Nextflow pipeline, focusing on simplifying configuration, improving compatibility with Nextflow's strict syntax, standardizing column naming, and enhancing sample and gene family aggregation processes. The changes also include code clean-up by removing obsolete processes and improving resource management.

Key changes:

Configuration and Resource Management

  • Updated conf/base.config to remove the check_max resource limiting function and instead use direct assignments and the resourceLimits directive for process resources, simplifying configuration and aligning with modern Nextflow best practices. The pipeline name in the config header was also corrected. [1] [2] [3] [4]
  • Updated nextflow.config to use the latest nf-schema plugin version and removed the explicit inclusion of base.config, relying on standard Nextflow config loading.

Parameter Handling and Documentation

  • Updated the README.md to explain the new requirement to supply non-default parameters via params.yml due to Nextflow strict syntax, replacing command-line parameter examples accordingly.

Data Standardization and Processing

  • Standardized column names across the pipeline from CDR3b to junction_aa (and similar for other columns) in OLGA, TCRsharing, and related modules, ensuring consistency and compatibility. [1] [2] [3] [4]
  • Updated the ANNOTATE_PROCESS to output additional pre-filter statistics, include more columns in output, and filter for productive sequences, improving downstream data quality and reporting.
  • Modified GIANA and GLIPH2 clustering steps to standardize input column names and ensure compatibility with external tools. [1] [2] [3]

Sample and Gene Family Aggregation

  • Added a new SAMPLE_CALC_PIVOT process to pivot gene family counts into wide format CSVs, with custom gene sorting for clarity and downstream analysis.
  • Enhanced SAMPLE_CALC to accept and utilize pre-filter statistics as input, supporting improved sample-level calculations. [1] [2]
  • Removed obsolete SAMPLE_AGGREGATE and SAMPLESHEET_RESOLVE processes, streamlining the codebase. [1] [2]

Miscellaneous Improvements

  • Updated the default subject column for V gene plots in .cirro/process-form.json for clarity.
  • Improved R and Python scripts in various modules for better data handling, consistency, and output formatting (e.g., in TCRPHENO, SAMPLESHEET_PHENO). [1] [2] [3]
  • Removed redundant container directives from certain process definitions, relying on global configuration. [1] [2]
  • Removed an unnecessary DSL version specification from main.nf.

These changes collectively modernize the pipeline, improve reproducibility, and make the codebase easier to maintain and extend.

dltamayo and others added 15 commits May 26, 2026 12:38
- replace sample_agg with collectFile()
- Use AIRR format throughout pipeline
- Only rename columns for tools that require specific column names for input (giana, gliph)
- Filtering occurs at ANNOTATE:ANNOTATE_PROCESS, upstream of Sample/Compare workflows. Only clones with productive CDR3 sequences and called Vgenes are kept; nonproductive clones and clones with uncalled Vgenes are filtered out. All downstream tools/analyses will work off of filtered repertoires.
- Total productive/nonproductive clones is calculated prior to filtering and passed to sample_calc.py
- Bug fix for broken output files tcrpheno as well as removal of filtering step in the Rscript
Implementing bugfixes from commit 389a328:
  1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were
     built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently
     dropping any gene with a number above those limits. The max index is now
     derived dynamically from genes observed in each sample.
- SAMPLE_CALC now outputs genes and counts in long form, which is then collected and pivoted wide in SAMPLE_CALC_PIVOT.

  3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or
     contig quality before pseudobulking. is_cell, high_confidence, and productive
     filters are now applied in both pseudobulk() and pseudobulk_phenotype() when
     those columns are present, ensuring background barcodes, low-confidence
     assemblies, and non-productive contigs are excluded from single-cell input.

Co-Authored-By: KevinMLanderos <kevinmezalanderos@gmail.com>
- With the refactoring of compare_calc.py, there are no tasks that read the file paths from the input samplesheet directly, which required an intermediate localization/resolving step in the case of s3 on Batch. All inputs now use sample_map.
Co-Authored-By: dimalvovs <dmitrijs.lvovs@gmail.com>
- Mostly replacing .set {} with def =
Using params.yml to supply parameters for strict syntax, instead of CLI
- With refactoring of gliph2 to patient subworkflow, compute resource needs should be alleviated for smaller datasets
- Making name of metadata field indicating subject-level grouping of samples consistent with patient workflow
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Unit Test Results

10 tests   10 ✅  3m 14s ⏱️
 2 suites   0 💤
 1 files     0 ❌

Results for commit 92486a3.

♻️ This comment has been updated with latest results.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the TCRtoolkit Nextflow pipeline to Nextflow 26.04.3 / strict-parser expectations while refactoring several subworkflows and updating metadata conventions (notably subject_idpatient). It also updates the nf-schema plugin/schema and streamlines sample-level aggregation logic.

Changes:

  • Refactors workflow wiring to pass annotated outputs downstream (including new per-sample pre-filter stats).
  • Renames samplesheet metadata field from subject_id to patient across tests/fixtures/config/schema.
  • Updates config/schema for strict parsing and adds a params.yml example for running the pipeline.

Reviewed changes

Copilot reviewed 37 out of 41 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
workflows/tcrtoolkit.nf Updates workflow wiring (ANNOTATE → SAMPLE/PATIENT/COMPARE) and removes old samplesheet resolution path.
subworkflows/local/input_check.nf Refactors INPUT_CHECK outputs (currently introduces invalid emit: placement).
subworkflows/local/annotate.nf Emits new per_sample_stats and refactors intermediate channels.
subworkflows/local/sample.nf Joins annotated samples with per-sample stats, replaces custom aggregation with collectFile + new pivot process.
modules/local/annotate/main.nf Adds pre_filter_stats sidecar output and filters clones to productive.
modules/local/sample/sample_calc.nf Extends SAMPLE_CALC inputs and introduces SAMPLE_CALC_PIVOT for long→wide gene-family tables.
bin/sample_calc.py Switches gene-family output to long format; consumes pre-filter stats sidecar.
subworkflows/local/pseudobulk_phenotype.nf Refactors aggregation for phenotype pseudobulk workflow (currently incompatible with new SAMPLE_CALC signature).
subworkflows/local/convert/* Refactors CONVERT wiring and subworkflow emits (currently mismatched output naming).
modules/local/compare/tcrsharing.nf Updates sharing calc (currently uses wrong key column name for current concat format).
modules/local/olga/main.nf Renames pgen key header to junction_aa and updates index loading accordingly.
nextflow.config Updates nf-schema plugin version, strict-typed resource params, config include order, and defaults (incl. use_gliph2).
nextflow_schema.json Updates schema types/defaults and adds OLGA options + use_gliph2 default.
conf/base.config Simplifies resource handling and switches to resourceLimits.
README.md / params.yml Documents strict-syntax usage via -params-file and adds a minimal params file.
tests/* / fixtures/* / .cirro/process-form.json Updates samplesheet headers and tests for patient field.
subworkflows/local/validate_params.nf Disables paramsSummaryLog with a version-reference comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread workflows/tcrtoolkit.nf
Comment thread subworkflows/local/input_check.nf
Comment thread subworkflows/local/convert/main.nf
Comment thread subworkflows/local/convert/adaptive.nf
Comment thread subworkflows/local/convert/pseudobulk_cellranger.nf
Comment thread modules/local/compare/tcrsharing.nf
Comment thread subworkflows/local/pseudobulk_phenotype.nf
Comment thread modules/local/sample/sample_calc.nf
Comment thread nextflow.config
Comment thread subworkflows/local/validate_params.nf Outdated
dltamayo added 2 commits June 17, 2026 13:21
publishDir and container are more appropriate for base.config (globally for all processes), rather than modules.config (overrides of defaults at a module level)
- update NF version naming in comment
- revert new File() back to file() to allow for URLs to be submitted
@dltamayo dltamayo merged commit e931159 into main Jun 17, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants