RedCEA: Repertoire Embeddings Denoising Clustering Enrichment Analysis

RedCEA is a pipeline for comparing immune repertoires using prototype-based TCR embeddings. It builds on the original TCRemP embedding method and adds denoising, clustering, and enrichment analysis for case/background repertoire comparisons.

This repository contains command-line tools for:

computing prototype-based embeddings for a repertoire with tcremp-run
clustering embeddings with tcremp-cluster
running the end-to-end comparison pipeline with redcea

Installation

Prerequisites

Prepare a clean Linux server with:

git
conda such as Miniconda or Mambaforge
internet access for Python package installation
access to GitHub, because dependency mir is installed from a git URL

Recommended Python version: 3.11.

Create the environment

git clone https://gitlab.aldan3.itm-rsmu.ru/isagroup/redcea.git
cd redcea

conda create -n redcea python=3.11 -y
conda activate redcea

python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .[test]

This installs:

the redcea package in editable mode
the RedCEA CLI entry points redcea and tcrempnet
the tcremp dependency, which provides tcremp-run and tcremp-cluster
test dependencies including pytest

Notes:

mir is installed from https://github.com/antigenomics/mirpy.git
default clustering for redcea is vdbscan
optional Leiden-based clustering requires an extra dependency

Optional: install Leiden support

If you plan to run --cluster-algo leiden, hierarchical_leiden, or leiden_dbscan, install the optional Leiden dependency:

python -m pip install .[leiden]

If this optional install fails, you can still run the default vdbscan pipeline.

Clustering modes

redcea supports several clustering backends:

vdbscan: default RedCEA mode with per-group eps estimation on the joint sample/background graph
dbscan: legacy TCRempNet mode on reduced embeddings, with pre-filtering of points with d1 > eps before running plain DBSCAN
leiden: graph clustering on the joint KNN graph
hierarchical_leiden: two-stage Leiden clustering
leiden_dbscan: Leiden followed by per-cluster DBSCAN refinement

Use dbscan if you need behavior closer to historical TCRempNet runs and want the old epsilon-based noise pre-filter back.

Verify the installation

Run the following commands in the activated environment:

python -c "import redcea, tcremp, mir; print('imports: OK')"
redcea --help
tcremp-run --help
pytest -q

Expected result:

imports succeed without ModuleNotFoundError
CLI help is printed for the requested commands
tests pass

Important limitation:

the test suite validates parser logic, helper functions, and a mocked pipeline smoke test
it does not replace a real run on a small dataset in your target environment

Running RedCEA

Option 1: two-step execution

If you plan to compare multiple case samples against the same background repertoire, compute the background embeddings once with tcremp-run and reuse them in downstream redcea runs.

Step 1: compute embeddings

tcremp-run \
  --input /projects/immunestatus/airr_format/sample.tsv \
  --output ./results \
  --chain TRB \
  -np 48

This produces embedding outputs in ./results.

For large samples, embedding can take hours and requires substantial CPU and memory.

Step 2: run `redcea` on saved embeddings

redcea \
  -is /projects/immunestatus/airr_format/sample.tsv \
  -ib /projects/immunestatus/airr_format/background.tsv \
  -c TRB \
  -o ./results \
  -np 4 \
  -se ./results/sample_tcremp.parquet \
  -be ./results/background_tcremp.parquet

Use this mode when the embedding files already exist and you want to skip recomputation.

Option 2: end-to-end pipeline

redcea \
  -is sample.tsv \
  -ib background.tsv \
  -c TRB \
  -o ./results \
  -np 8

In this mode, embeddings for both sample and background are computed automatically if they are not already available.

CLI Tools

CLI Tool	Description
`tcremp-run`	Computes TCRemP embeddings and optional clustering
`redcea`	Runs embedding, clustering, and enrichment
`tcremp-cluster`	Clusters existing embeddings

Example: Yellow Fever Dataset

redcea \
  --sample /projects/immunestatus/pogorelyy/airr_format/yfv_day_15.txt \
  --background /projects/immunestatus/pogorelyy/airr_format/yfv_day_0.txt \
  --output /projects/immunestatus/pogorelyy/redcea/yfv_res \
  --chain TRB \
  --prefix yfv_result \
  -np 16

Output Files

Depending on the mode, the pipeline may create:

File Name	Description
`*_sample_embeddings.parquet`	Sample embeddings produced or reused by `redcea`
`*_background_embeddings.parquet`	Background embeddings produced or reused by `redcea`
`*_tcremp_clusters.tsv`	Cluster assignments for both sample and background clonotypes
`*_summary_tcrempnet.tsv`	Per-cluster summary with counts, p-values, FDR, and log fold change
`*_enriched_clonotypes_tcremp.tsv`	Clonotypes from enriched clusters
`*_enriched_embeddings_tcremp.parquet`	Embeddings of enriched clonotypes with cluster metadata
`*.log`	Run log for debugging and runtime tracking

What to check after a real run

Treat the run as successful only if all of the following are true:

the output directory exists
*_sample_embeddings.parquet and *_background_embeddings.parquet exist or were intentionally supplied as inputs
*_tcremp_clusters.tsv exists
*_summary_tcrempnet.tsv exists and contains cluster_id, cluster_size, sample, background, enrichment_fdr_zbinom, and log_fold_change
the log file ends with TCRempNet pipeline completed.

Input Expectations

The pipeline expects repertoire tables that can be parsed by the underlying tcremp AIRR-loading utilities.

Before running on a clean server, verify on one small file that:

the file path is correct and readable by the current user
the repertoire contains the requested chain: TRA, TRB, or TRA_TRB
required CDR3 and V/J fields expected by tcremp are present
the file is not empty after filtering by chain and CDR3 length

If a run fails at startup, first check file format compatibility and chain selection.

SLURM Job Example

Activate the redcea environment before submitting the job.

Full pipeline

#!/bin/sh
#SBATCH --job-name=redcea
#SBATCH --cpus-per-task=48
#SBATCH --mem=128gb
#SBATCH --time=08:00:00
#SBATCH --output=redcea_run.%j.log

redcea \
  -is case.tsv \
  -ib control.tsv \
  -c TRB \
  -o ./results \
  -np 48

Embedding only

tcremp-run \
  --input case.tsv \
  --output ./results \
  --chain TRB \
  -np 32

Arguments

Short	Long	Required	Default	Description
`-is`	`--sample`	Yes	none	Path to the sample repertoire table
`-ib`	`--background`	Yes	none	Path to the background repertoire table
`-o`	`--output`	Yes	none	Output directory
`-e`	`--prefix`	No	input filename	Output prefix
`-x`	`--index-col`	No	none	Optional input ID column to preserve in outputs
`-c`	`--chain`	Yes	none	`TRA`, `TRB`, or `TRA_TRB`
`-p`	`--prototypes-path`	No	package defaults	Path to a user-supplied prototypes file
`-n`	`--n-prototypes`	No	all available	Number of prototypes used for embedding
none	`--sample-random-prototypes`	No	`False`	Sample prototypes randomly
`-nc`	`--n-clonotypes`	No	all available	Number of clonotypes to process
none	`--sample-random-clonotypes`	No	`False`	Sample clonotypes randomly
`-s`	`--species`	No	`HomoSapiens`	Species for V/J gene alignment
`-u`	`--unique-clonotypes`	No	`False`	Use only unique clonotypes
`-r`	`--random-seed`	No	`42`	Random seed
`-np`	`--nproc`	No	`1`	Number of worker processes
`-llen`	`--lower-len-cdr3`	No	`5`	Minimum CDR3 length
`-hlen`	`--higher-len-cdr3`	No	`30`	Maximum CDR3 length
`-m`	`--metrics`	No	`dissimilarity`	TCRemP metric mode
`-d`	`--save-dists`	No	`True`	Save TCRemP distances
`-cl`	`--cluster`	No	`True`	Run clustering in embedding workflow
`-se`	`--sample-embedding`	No	none	Path to precomputed sample embeddings
`-be`	`--background-embedding`	No	none	Path to precomputed background embeddings
`--cluster-algo`	`--cluster-algo`	No	`vdbscan`	`vdbscan`, `dbscan`, `leiden`, `hierarchical_leiden`, or `leiden_dbscan`
`--n-bg-points`	`--n-bg-points`	No	all available	Limit background clonotypes to first N entries
`-npc`	`--cluster-pc-components`	No	`50`	Number of PCA components before clustering
`-ms`	`--cluster-min-samples`	No	`3`	Core-point threshold for clustering
`-kn`	`--k-neighbors`	No	`4`	Number of neighbors in the KNN graph
`-ekn`	`--eps-k-neighbors`	No	`4`	K-th neighbor used for eps estimation in `vdbscan` and `dbscan`
`--leiden-resolution`	`--leiden-resolution`	No	`1.0`	Leiden resolution parameter
`--leiden-sub-resolution`	`--leiden-sub-resolution`	No	`1.0`	Subclustering resolution for `hierarchical_leiden`
`--eps-estimation-based-on`	`--eps-estimation-based-on`	No	`sample`	Estimate eps from `sample`, `background`, or `all`
`--vdbscan-sym-rule`	`--vdbscan-sym-rule`	No	`asymmetric`	Symmetrization rule: `asymmetric`, `min`, or `max`

Reference

Vlasova et al., RedCEA: repertoire embeddings denoising clustering enrichment analysis, 2025, in preparation.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
docs		docs
notebooks		notebooks
redcea		redcea
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RedCEA: Repertoire Embeddings Denoising Clustering Enrichment Analysis

Installation

Prerequisites

Create the environment

Optional: install Leiden support

Clustering modes

Verify the installation

Recommended post-install smoke check

Running RedCEA

Option 1: two-step execution

Step 1: compute embeddings

Step 2: run `redcea` on saved embeddings

Option 2: end-to-end pipeline

CLI Tools

Example: Yellow Fever Dataset

Output Files

What to check after a real run

Input Expectations

SLURM Job Example

Full pipeline

Embedding only

Arguments

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RedCEA: Repertoire Embeddings Denoising Clustering Enrichment Analysis

Installation

Prerequisites

Create the environment

Optional: install Leiden support

Clustering modes

Verify the installation

Recommended post-install smoke check

Running RedCEA

Option 1: two-step execution

Step 1: compute embeddings

Step 2: run redcea on saved embeddings

Option 2: end-to-end pipeline

CLI Tools

Example: Yellow Fever Dataset

Output Files

What to check after a real run

Input Expectations

SLURM Job Example

Full pipeline

Embedding only

Arguments

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 2: run `redcea` on saved embeddings

Packages