Nextflow 26.04.3 and strict parser migration, other refactoring#86
Merged
Conversation
- replace sample_agg with collectFile()
- Use AIRR format throughout pipeline - Only rename columns for tools that require specific column names for input (giana, gliph)
- Filtering occurs at ANNOTATE:ANNOTATE_PROCESS, upstream of Sample/Compare workflows. Only clones with productive CDR3 sequences and called Vgenes are kept; nonproductive clones and clones with uncalled Vgenes are filtered out. All downstream tools/analyses will work off of filtered repertoires. - Total productive/nonproductive clones is calculated prior to filtering and passed to sample_calc.py - Bug fix for broken output files tcrpheno as well as removal of filtering step in the Rscript
Implementing bugfixes from commit 389a328: 1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently dropping any gene with a number above those limits. The max index is now derived dynamically from genes observed in each sample. - SAMPLE_CALC now outputs genes and counts in long form, which is then collected and pivoted wide in SAMPLE_CALC_PIVOT. 3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or contig quality before pseudobulking. is_cell, high_confidence, and productive filters are now applied in both pseudobulk() and pseudobulk_phenotype() when those columns are present, ensuring background barcodes, low-confidence assemblies, and non-productive contigs are excluded from single-cell input. Co-Authored-By: KevinMLanderos <kevinmezalanderos@gmail.com>
- With the refactoring of compare_calc.py, there are no tasks that read the file paths from the input samplesheet directly, which required an intermediate localization/resolving step in the case of s3 on Batch. All inputs now use sample_map.
merge from main
…it-Bulk into dltamayo-dev
Co-Authored-By: dimalvovs <dmitrijs.lvovs@gmail.com>
- Mostly replacing .set {} with def =
Using params.yml to supply parameters for strict syntax, instead of CLI
- With refactoring of gliph2 to patient subworkflow, compute resource needs should be alleviated for smaller datasets
- Making name of metadata field indicating subject-level grouping of samples consistent with patient workflow
Unit Test Results10 tests 10 ✅ 3m 14s ⏱️ Results for commit 92486a3. ♻️ This comment has been updated with latest results. |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR migrates the TCRtoolkit Nextflow pipeline to Nextflow 26.04.3 / strict-parser expectations while refactoring several subworkflows and updating metadata conventions (notably subject_id → patient). It also updates the nf-schema plugin/schema and streamlines sample-level aggregation logic.
Changes:
- Refactors workflow wiring to pass annotated outputs downstream (including new per-sample pre-filter stats).
- Renames samplesheet metadata field from
subject_idtopatientacross tests/fixtures/config/schema. - Updates config/schema for strict parsing and adds a
params.ymlexample for running the pipeline.
Reviewed changes
Copilot reviewed 37 out of 41 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/tcrtoolkit.nf | Updates workflow wiring (ANNOTATE → SAMPLE/PATIENT/COMPARE) and removes old samplesheet resolution path. |
| subworkflows/local/input_check.nf | Refactors INPUT_CHECK outputs (currently introduces invalid emit: placement). |
| subworkflows/local/annotate.nf | Emits new per_sample_stats and refactors intermediate channels. |
| subworkflows/local/sample.nf | Joins annotated samples with per-sample stats, replaces custom aggregation with collectFile + new pivot process. |
| modules/local/annotate/main.nf | Adds pre_filter_stats sidecar output and filters clones to productive. |
| modules/local/sample/sample_calc.nf | Extends SAMPLE_CALC inputs and introduces SAMPLE_CALC_PIVOT for long→wide gene-family tables. |
| bin/sample_calc.py | Switches gene-family output to long format; consumes pre-filter stats sidecar. |
| subworkflows/local/pseudobulk_phenotype.nf | Refactors aggregation for phenotype pseudobulk workflow (currently incompatible with new SAMPLE_CALC signature). |
| subworkflows/local/convert/* | Refactors CONVERT wiring and subworkflow emits (currently mismatched output naming). |
| modules/local/compare/tcrsharing.nf | Updates sharing calc (currently uses wrong key column name for current concat format). |
| modules/local/olga/main.nf | Renames pgen key header to junction_aa and updates index loading accordingly. |
| nextflow.config | Updates nf-schema plugin version, strict-typed resource params, config include order, and defaults (incl. use_gliph2). |
| nextflow_schema.json | Updates schema types/defaults and adds OLGA options + use_gliph2 default. |
| conf/base.config | Simplifies resource handling and switches to resourceLimits. |
| README.md / params.yml | Documents strict-syntax usage via -params-file and adds a minimal params file. |
| tests/* / fixtures/* / .cirro/process-form.json | Updates samplesheet headers and tests for patient field. |
| subworkflows/local/validate_params.nf | Disables paramsSummaryLog with a version-reference comment. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
publishDir and container are more appropriate for base.config (globally for all processes), rather than modules.config (overrides of defaults at a module level)
- update NF version naming in comment - revert new File() back to file() to allow for URLs to be submitted
This was referenced Jun 17, 2026
Closed
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several significant updates and refactorings to the TCRtoolkit Nextflow pipeline, focusing on simplifying configuration, improving compatibility with Nextflow's strict syntax, standardizing column naming, and enhancing sample and gene family aggregation processes. The changes also include code clean-up by removing obsolete processes and improving resource management.
Key changes:
Configuration and Resource Management
conf/base.configto remove thecheck_maxresource limiting function and instead use direct assignments and theresourceLimitsdirective for process resources, simplifying configuration and aligning with modern Nextflow best practices. The pipeline name in the config header was also corrected. [1] [2] [3] [4]nextflow.configto use the latestnf-schemaplugin version and removed the explicit inclusion ofbase.config, relying on standard Nextflow config loading.Parameter Handling and Documentation
README.mdto explain the new requirement to supply non-default parameters viaparams.ymldue to Nextflow strict syntax, replacing command-line parameter examples accordingly.Data Standardization and Processing
CDR3btojunction_aa(and similar for other columns) in OLGA, TCRsharing, and related modules, ensuring consistency and compatibility. [1] [2] [3] [4]ANNOTATE_PROCESSto output additional pre-filter statistics, include more columns in output, and filter for productive sequences, improving downstream data quality and reporting.Sample and Gene Family Aggregation
SAMPLE_CALC_PIVOTprocess to pivot gene family counts into wide format CSVs, with custom gene sorting for clarity and downstream analysis.SAMPLE_CALCto accept and utilize pre-filter statistics as input, supporting improved sample-level calculations. [1] [2]SAMPLE_AGGREGATEandSAMPLESHEET_RESOLVEprocesses, streamlining the codebase. [1] [2]Miscellaneous Improvements
.cirro/process-form.jsonfor clarity.TCRPHENO,SAMPLESHEET_PHENO). [1] [2] [3]containerdirectives from certain process definitions, relying on global configuration. [1] [2]main.nf.These changes collectively modernize the pipeline, improve reproducibility, and make the codebase easier to maintain and extend.