RedCEA is a pipeline for comparing immune repertoires using prototype-based TCR embeddings. It builds on the original TCRemP embedding method and adds denoising, clustering, and enrichment analysis for case/background repertoire comparisons.
This repository contains command-line tools for:
- computing prototype-based embeddings for a repertoire with
tcremp-run - clustering embeddings with
tcremp-cluster - running the end-to-end comparison pipeline with
redcea
Prepare a clean Linux server with:
gitcondasuch as Miniconda or Mambaforge- internet access for Python package installation
- access to GitHub, because dependency
miris installed from a git URL
Recommended Python version: 3.11.
git clone https://gitlab.aldan3.itm-rsmu.ru/isagroup/redcea.git
cd redcea
conda create -n redcea python=3.11 -y
conda activate redcea
python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .[test]This installs:
- the
redceapackage in editable mode - the RedCEA CLI entry points
redceaandtcrempnet - the
tcrempdependency, which providestcremp-runandtcremp-cluster - test dependencies including
pytest
Notes:
miris installed fromhttps://github.com/antigenomics/mirpy.git- default clustering for
redceaisvdbscan - optional Leiden-based clustering requires an extra dependency
If you plan to run --cluster-algo leiden, hierarchical_leiden, or leiden_dbscan, install the optional Leiden dependency:
python -m pip install .[leiden]If this optional install fails, you can still run the default vdbscan pipeline.
redcea supports several clustering backends:
vdbscan: default RedCEA mode with per-groupepsestimation on the joint sample/background graphdbscan: legacy TCRempNet mode on reduced embeddings, with pre-filtering of points withd1 > epsbefore running plainDBSCANleiden: graph clustering on the joint KNN graphhierarchical_leiden: two-stage Leiden clusteringleiden_dbscan: Leiden followed by per-cluster DBSCAN refinement
Use dbscan if you need behavior closer to historical TCRempNet runs and want the old epsilon-based noise pre-filter back.
Run the following commands in the activated environment:
python -c "import redcea, tcremp, mir; print('imports: OK')"
redcea --help
tcremp-run --help
pytest -qExpected result:
- imports succeed without
ModuleNotFoundError - CLI help is printed for the requested commands
- tests pass
Important limitation:
- the test suite validates parser logic, helper functions, and a mocked pipeline smoke test
- it does not replace a real run on a small dataset in your target environment
For a fresh server, use this validation sequence:
- install the package with
python -m pip install -e .[test] - run
redcea --help - run
pytest -q - run one small real dataset through
redceaortcremp-run - confirm that expected output files are created and the log ends without runtime errors
If you plan to compare multiple case samples against the same background repertoire, compute the background embeddings once with tcremp-run and reuse them in downstream redcea runs.
tcremp-run \
--input /projects/immunestatus/airr_format/sample.tsv \
--output ./results \
--chain TRB \
-np 48This produces embedding outputs in ./results.
For large samples, embedding can take hours and requires substantial CPU and memory.
redcea \
-is /projects/immunestatus/airr_format/sample.tsv \
-ib /projects/immunestatus/airr_format/background.tsv \
-c TRB \
-o ./results \
-np 4 \
-se ./results/sample_tcremp.parquet \
-be ./results/background_tcremp.parquetUse this mode when the embedding files already exist and you want to skip recomputation.
redcea \
-is sample.tsv \
-ib background.tsv \
-c TRB \
-o ./results \
-np 8In this mode, embeddings for both sample and background are computed automatically if they are not already available.
| CLI Tool | Description |
|---|---|
tcremp-run |
Computes TCRemP embeddings and optional clustering |
redcea |
Runs embedding, clustering, and enrichment |
tcremp-cluster |
Clusters existing embeddings |
redcea \
--sample /projects/immunestatus/pogorelyy/airr_format/yfv_day_15.txt \
--background /projects/immunestatus/pogorelyy/airr_format/yfv_day_0.txt \
--output /projects/immunestatus/pogorelyy/redcea/yfv_res \
--chain TRB \
--prefix yfv_result \
-np 16Depending on the mode, the pipeline may create:
| File Name | Description |
|---|---|
*_sample_embeddings.parquet |
Sample embeddings produced or reused by redcea |
*_background_embeddings.parquet |
Background embeddings produced or reused by redcea |
*_tcremp_clusters.tsv |
Cluster assignments for both sample and background clonotypes |
*_summary_tcrempnet.tsv |
Per-cluster summary with counts, p-values, FDR, and log fold change |
*_enriched_clonotypes_tcremp.tsv |
Clonotypes from enriched clusters |
*_enriched_embeddings_tcremp.parquet |
Embeddings of enriched clonotypes with cluster metadata |
*.log |
Run log for debugging and runtime tracking |
Treat the run as successful only if all of the following are true:
- the output directory exists
*_sample_embeddings.parquetand*_background_embeddings.parquetexist or were intentionally supplied as inputs*_tcremp_clusters.tsvexists*_summary_tcrempnet.tsvexists and containscluster_id,cluster_size,sample,background,enrichment_fdr_zbinom, andlog_fold_change- the log file ends with
TCRempNet pipeline completed.
The pipeline expects repertoire tables that can be parsed by the underlying tcremp AIRR-loading utilities.
Before running on a clean server, verify on one small file that:
- the file path is correct and readable by the current user
- the repertoire contains the requested chain:
TRA,TRB, orTRA_TRB - required CDR3 and V/J fields expected by
tcrempare present - the file is not empty after filtering by chain and CDR3 length
If a run fails at startup, first check file format compatibility and chain selection.
Activate the redcea environment before submitting the job.
#!/bin/sh
#SBATCH --job-name=redcea
#SBATCH --cpus-per-task=48
#SBATCH --mem=128gb
#SBATCH --time=08:00:00
#SBATCH --output=redcea_run.%j.log
redcea \
-is case.tsv \
-ib control.tsv \
-c TRB \
-o ./results \
-np 48tcremp-run \
--input case.tsv \
--output ./results \
--chain TRB \
-np 32| Short | Long | Required | Default | Description |
|---|---|---|---|---|
-is |
--sample |
Yes | none | Path to the sample repertoire table |
-ib |
--background |
Yes | none | Path to the background repertoire table |
-o |
--output |
Yes | none | Output directory |
-e |
--prefix |
No | input filename | Output prefix |
-x |
--index-col |
No | none | Optional input ID column to preserve in outputs |
-c |
--chain |
Yes | none | TRA, TRB, or TRA_TRB |
-p |
--prototypes-path |
No | package defaults | Path to a user-supplied prototypes file |
-n |
--n-prototypes |
No | all available | Number of prototypes used for embedding |
| none | --sample-random-prototypes |
No | False |
Sample prototypes randomly |
-nc |
--n-clonotypes |
No | all available | Number of clonotypes to process |
| none | --sample-random-clonotypes |
No | False |
Sample clonotypes randomly |
-s |
--species |
No | HomoSapiens |
Species for V/J gene alignment |
-u |
--unique-clonotypes |
No | False |
Use only unique clonotypes |
-r |
--random-seed |
No | 42 |
Random seed |
-np |
--nproc |
No | 1 |
Number of worker processes |
-llen |
--lower-len-cdr3 |
No | 5 |
Minimum CDR3 length |
-hlen |
--higher-len-cdr3 |
No | 30 |
Maximum CDR3 length |
-m |
--metrics |
No | dissimilarity |
TCRemP metric mode |
-d |
--save-dists |
No | True |
Save TCRemP distances |
-cl |
--cluster |
No | True |
Run clustering in embedding workflow |
-se |
--sample-embedding |
No | none | Path to precomputed sample embeddings |
-be |
--background-embedding |
No | none | Path to precomputed background embeddings |
--cluster-algo |
--cluster-algo |
No | vdbscan |
vdbscan, dbscan, leiden, hierarchical_leiden, or leiden_dbscan |
--n-bg-points |
--n-bg-points |
No | all available | Limit background clonotypes to first N entries |
-npc |
--cluster-pc-components |
No | 50 |
Number of PCA components before clustering |
-ms |
--cluster-min-samples |
No | 3 |
Core-point threshold for clustering |
-kn |
--k-neighbors |
No | 4 |
Number of neighbors in the KNN graph |
-ekn |
--eps-k-neighbors |
No | 4 |
K-th neighbor used for eps estimation in vdbscan and dbscan |
--leiden-resolution |
--leiden-resolution |
No | 1.0 |
Leiden resolution parameter |
--leiden-sub-resolution |
--leiden-sub-resolution |
No | 1.0 |
Subclustering resolution for hierarchical_leiden |
--eps-estimation-based-on |
--eps-estimation-based-on |
No | sample |
Estimate eps from sample, background, or all |
--vdbscan-sym-rule |
--vdbscan-sym-rule |
No | asymmetric |
Symmetrization rule: asymmetric, min, or max |
Vlasova et al., RedCEA: repertoire embeddings denoising clustering enrichment analysis, 2025, in preparation.