Nextflow 26.04.3 and strict parser migration, other refactoring by dltamayo · Pull Request #86 · KarchinLab/TCRtoolkit

dltamayo · 2026-06-17T14:58:12Z

This pull request introduces several significant updates and refactorings to the TCRtoolkit Nextflow pipeline, focusing on simplifying configuration, improving compatibility with Nextflow's strict syntax, standardizing column naming, and enhancing sample and gene family aggregation processes. The changes also include code clean-up by removing obsolete processes and improving resource management.

Key changes:

Configuration and Resource Management

Updated conf/base.config to remove the check_max resource limiting function and instead use direct assignments and the resourceLimits directive for process resources, simplifying configuration and aligning with modern Nextflow best practices. The pipeline name in the config header was also corrected. [1] [2] [3] [4]
Updated nextflow.config to use the latest nf-schema plugin version and removed the explicit inclusion of base.config, relying on standard Nextflow config loading.

Parameter Handling and Documentation

Updated the README.md to explain the new requirement to supply non-default parameters via params.yml due to Nextflow strict syntax, replacing command-line parameter examples accordingly.

Data Standardization and Processing

Standardized column names across the pipeline from CDR3b to junction_aa (and similar for other columns) in OLGA, TCRsharing, and related modules, ensuring consistency and compatibility. [1] [2] [3] [4]
Updated the ANNOTATE_PROCESS to output additional pre-filter statistics, include more columns in output, and filter for productive sequences, improving downstream data quality and reporting.
Modified GIANA and GLIPH2 clustering steps to standardize input column names and ensure compatibility with external tools. [1] [2] [3]

Sample and Gene Family Aggregation

Added a new SAMPLE_CALC_PIVOT process to pivot gene family counts into wide format CSVs, with custom gene sorting for clarity and downstream analysis.
Enhanced SAMPLE_CALC to accept and utilize pre-filter statistics as input, supporting improved sample-level calculations. [1] [2]
Removed obsolete SAMPLE_AGGREGATE and SAMPLESHEET_RESOLVE processes, streamlining the codebase. [1] [2]

Miscellaneous Improvements

Updated the default subject column for V gene plots in .cirro/process-form.json for clarity.
Improved R and Python scripts in various modules for better data handling, consistency, and output formatting (e.g., in TCRPHENO, SAMPLESHEET_PHENO). [1] [2] [3]
Removed redundant container directives from certain process definitions, relying on global configuration. [1] [2]
Removed an unnecessary DSL version specification from main.nf.

These changes collectively modernize the pipeline, improve reproducibility, and make the codebase easier to maintain and extend.

- replace sample_agg with collectFile()

- Use AIRR format throughout pipeline - Only rename columns for tools that require specific column names for input (giana, gliph)

- Filtering occurs at ANNOTATE:ANNOTATE_PROCESS, upstream of Sample/Compare workflows. Only clones with productive CDR3 sequences and called Vgenes are kept; nonproductive clones and clones with uncalled Vgenes are filtered out. All downstream tools/analyses will work off of filtered repertoires. - Total productive/nonproductive clones is calculated prior to filtering and passed to sample_calc.py - Bug fix for broken output files tcrpheno as well as removal of filtering step in the Rscript

Implementing bugfixes from commit 389a328: 1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently dropping any gene with a number above those limits. The max index is now derived dynamically from genes observed in each sample. - SAMPLE_CALC now outputs genes and counts in long form, which is then collected and pivoted wide in SAMPLE_CALC_PIVOT. 3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or contig quality before pseudobulking. is_cell, high_confidence, and productive filters are now applied in both pseudobulk() and pseudobulk_phenotype() when those columns are present, ensuring background barcodes, low-confidence assemblies, and non-productive contigs are excluded from single-cell input. Co-Authored-By: KevinMLanderos <kevinmezalanderos@gmail.com>

- With the refactoring of compare_calc.py, there are no tasks that read the file paths from the input samplesheet directly, which required an intermediate localization/resolving step in the case of s3 on Batch. All inputs now use sample_map.

merge from main

…it-Bulk into dltamayo-dev

Co-Authored-By: dimalvovs <dmitrijs.lvovs@gmail.com>

- Mostly replacing .set {} with def =

Using params.yml to supply parameters for strict syntax, instead of CLI

- With refactoring of gliph2 to patient subworkflow, compute resource needs should be alleviated for smaller datasets

- Making name of metadata field indicating subject-level grouping of samples consistent with patient workflow

github-actions · 2026-06-17T15:01:38Z

Unit Test Results

10 tests 10 ✅ 3m 14s ⏱️
2 suites 0 💤
1 files 0 ❌

Results for commit 92486a3.

♻️ This comment has been updated with latest results.

Copilot

Pull request overview

This PR migrates the TCRtoolkit Nextflow pipeline to Nextflow 26.04.3 / strict-parser expectations while refactoring several subworkflows and updating metadata conventions (notably subject_id → patient). It also updates the nf-schema plugin/schema and streamlines sample-level aggregation logic.

Changes:

Refactors workflow wiring to pass annotated outputs downstream (including new per-sample pre-filter stats).
Renames samplesheet metadata field from subject_id to patient across tests/fixtures/config/schema.
Updates config/schema for strict parsing and adds a params.yml example for running the pipeline.

Reviewed changes

Copilot reviewed 37 out of 41 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
workflows/tcrtoolkit.nf	Updates workflow wiring (ANNOTATE → SAMPLE/PATIENT/COMPARE) and removes old samplesheet resolution path.
subworkflows/local/input_check.nf	Refactors INPUT_CHECK outputs (currently introduces invalid `emit:` placement).
subworkflows/local/annotate.nf	Emits new `per_sample_stats` and refactors intermediate channels.
subworkflows/local/sample.nf	Joins annotated samples with per-sample stats, replaces custom aggregation with `collectFile` + new pivot process.
modules/local/annotate/main.nf	Adds `pre_filter_stats` sidecar output and filters clones to productive.
modules/local/sample/sample_calc.nf	Extends SAMPLE_CALC inputs and introduces SAMPLE_CALC_PIVOT for long→wide gene-family tables.
bin/sample_calc.py	Switches gene-family output to long format; consumes pre-filter stats sidecar.
subworkflows/local/pseudobulk_phenotype.nf	Refactors aggregation for phenotype pseudobulk workflow (currently incompatible with new SAMPLE_CALC signature).
subworkflows/local/convert/*	Refactors CONVERT wiring and subworkflow emits (currently mismatched output naming).
modules/local/compare/tcrsharing.nf	Updates sharing calc (currently uses wrong key column name for current concat format).
modules/local/olga/main.nf	Renames pgen key header to `junction_aa` and updates index loading accordingly.
nextflow.config	Updates nf-schema plugin version, strict-typed resource params, config include order, and defaults (incl. `use_gliph2`).
nextflow_schema.json	Updates schema types/defaults and adds OLGA options + use_gliph2 default.
conf/base.config	Simplifies resource handling and switches to `resourceLimits`.
README.md / params.yml	Documents strict-syntax usage via `-params-file` and adds a minimal params file.
tests/* / fixtures/* / .cirro/process-form.json	Updates samplesheet headers and tests for `patient` field.
subworkflows/local/validate_params.nf	Disables paramsSummaryLog with a version-reference comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

publishDir and container are more appropriate for base.config (globally for all processes), rather than modules.config (overrides of defaults at a module level)

- update NF version naming in comment - revert new File() back to file() to allow for URLs to be submitted

dltamayo and others added 15 commits May 26, 2026 12:38

Refactor sample_calc

a6c98af

- replace sample_agg with collectFile()

Standardize column naming

19be958

- Use AIRR format throughout pipeline - Only rename columns for tools that require specific column names for input (giana, gliph)

Remove resolve_samplesheet

d8b9801

- With the refactoring of compare_calc.py, there are no tasks that read the file paths from the input samplesheet directly, which required an intermediate localization/resolving step in the case of s3 on Batch. All inputs now use sample_map.

Merge pull request #85 from KarchinLab/main

acadc3e

merge from main

Merge branch 'dltamayo-dev' of https://github.com/KarchinLab/TCRtoolk…

4cd53e7

…it-Bulk into dltamayo-dev

recycle check_max

3163fde

Co-Authored-By: dimalvovs <dmitrijs.lvovs@gmail.com>

Update linting for Nextflow 26.4

48dfdad

Remove container

2d8aa2a

Update schema

6453e11

Init Syntax v2 migration

a0b44e1

- Mostly replacing .set {} with def =

Update README and init params.yml

5b72235

Using params.yml to supply parameters for strict syntax, instead of CLI

Enable gliph2 by default

41feefc

- With refactoring of gliph2 to patient subworkflow, compute resource needs should be alleviated for smaller datasets

Replace subject_id with patient

2f50a68

- Making name of metadata field indicating subject-level grouping of samples consistent with patient workflow

Update test samplesheet

2d00794

dimalvovs requested a review from Copilot June 17, 2026 16:19

Copilot started reviewing on behalf of dimalvovs June 17, 2026 16:20 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

dltamayo added 2 commits June 17, 2026 13:21

Update configs

184d9e1

publishDir and container are more appropriate for base.config (globally for all processes), rather than modules.config (overrides of defaults at a module level)

Implement Copilot suggestions

92486a3

- update NF version naming in comment - revert new File() back to file() to allow for URLs to be submitted

dltamayo merged commit e931159 into main Jun 17, 2026
3 checks passed

This was referenced Jun 17, 2026

fix: filter non-productive TCRs, dynamic gene family columns, single-… #82

Closed

strict parser update #84

Closed

add pseudobulk tests #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextflow 26.04.3 and strict parser migration, other refactoring#86

Nextflow 26.04.3 and strict parser migration, other refactoring#86
dltamayo merged 18 commits into
mainfrom
dltamayo-dev

dltamayo commented Jun 17, 2026 •

edited by dimalvovs

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dltamayo commented Jun 17, 2026 • edited by dimalvovs Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Configuration and Resource Management

Parameter Handling and Documentation

Data Standardization and Processing

Sample and Gene Family Aggregation

Miscellaneous Improvements

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dltamayo commented Jun 17, 2026 •

edited by dimalvovs

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading