bakeoff - Long-read taxonomic profiling of the gut microbiome

Companion repository for Huang et al., "Critical Evaluation of Long Read Taxonomic Profiling of the Gut Microbiome" (manuscript in preparation).

Introduction

This study benchmarks six long-read-capable taxonomic profilers - Kraken2, Centrifuge, Centrifuger, Ganon2, Sourmash, and Sylph - across three dataset categories (PBSIM3 + MIMIC simulated reads, ZymoBIOMICS D6331 mock community, and the DYN clinical cohort) on both PacBio HiFi and Oxford Nanopore reads. Each method is evaluated under (a) its native default database and (b) a shared unified RefSeq-228 build (~69k genomes spanning archaea, bacteria, fungi, and viruses), holding the reference fixed to disentangle database effects from algorithm effects.

Three findings the paper argues:

Database recency matters across the board. Database choice materially affects accuracy regardless of the tool, especially for lower-abundance taxa. Harmonizing to a unified RefSeq reference visibly reduced cross-tool disagreement on both mock and clinical samples - keeping the reference current is best practice regardless of the algorithm.
Sketch-based profilers led the controlled benchmarks. Across simulated and mock data, Sourmash, Sylph, and Ganon2 posted the strongest detection performance, generally driven by higher precision at comparably high recall.
Controlled-benchmark rankings transfer only indirectly to real data. On the clinical cohort, cross-tool agreement among the more accurate methods held for moderate-to-high-abundance taxa but weakened on low-abundance ones - so a multi-method composite approach may still be valuable in practice.

Prior long-read benchmarks emphasized short reads, used method-specific databases (confounding algorithm vs. DB), and relied on narrow synthetic communities. This study fixes the reference and tests across mock, simulated, and clinical data to isolate where each effect lives.

Install

1. Clone the repository

git clone https://github.com/treangenlab/bakeoff.git
cd bakeoff

2. Create the conda environment

The full pipeline environment - the six paper profilers (version-pinned to manuscript Methods), profiler-adjacent helpers (sylph-tax, taxonkit, multitax, mosdepth, lemur), and the analysis stack (Python, Jupyter, scikit-learn, seaborn, ete3, biopython, etc.) - is declared in config/bakeoff_env.yaml:

conda env create -n bakeoff -f config/bakeoff_env.yaml
conda activate bakeoff

Mamba works identically (mamba env create ...) and is significantly faster on this env.

Pinned profiler versions:

Tool	Version	Notes
Kraken2	2.1.5
Centrifuge	1.0.4.2
Centrifuger	1.0.9
Ganon2	2.1.1
Sourmash	4.9.0
Sylph	0.8.1
sylph-tax	1.7.0	Other versions silently produce `NO_TAXONOMY` rows against the manuscript's metadata.

3. Verify the install

for t in kraken2 centrifuge centrifuger ganon sourmash sylph; do
  command -v $t >/dev/null && echo "$t: $(command -v $t)" || echo "$t: MISSING"
done

All six should resolve to paths inside the active conda env.

4. System tools for the data fetch (typically already present)

curl, jq, zstd ≥ 1.4, tar, sha256sum. Quick check:

for c in curl jq zstd tar sha256sum; do
  command -v $c >/dev/null && echo "$c: OK" || echo "$c: MISSING"
done

Quick reproduction

Seven of the ten manuscript figures (Figs 2-8) plus Table S1 and Table S2 reproduce from the analysis tables shipped with this repository - no reference databases, no read files, no profiler runs needed. After installing the env, just execute a notebook:

# cd into the bakeoff project root first
conda activate bakeoff
jupyter nbconvert --to notebook --execute scripts/analysis/analysis_detection.ipynb

The figure (and per-rank metric tables) land under results/<ts>/detection/threshold_<x>/<dataset>/{tables,figures}/. To render a different dataset or figure, change the DATASET (and threshold where relevant) in the notebook's config cell at the top:

Fig / Table	Notebook	What to set
2, 3, S2	`scripts/analysis/analysis_detection.ipynb`	`DATASET` (mock or simulated key)
4, Table S2	`scripts/analysis/auxiliary/sensitivity_floor.ipynb`	- (pools all four D6331 mocks; emits Fig 4 panels + the `rare_tier_detection_matrix.csv` that backs Table S2)
5, 6, S3, S4, S5	`scripts/analysis/analysis_abundance.ipynb`	`DATASET` (mock key)
7, S6, S7	`scripts/analysis/dyn_heatmap.ipynb`	`DATASET`, `MIN_ABUNDANCE`
8, S8	`scripts/analysis/dyn_alpha_div.ipynb`	cohorts, `MIN_ABUNDANCE`
Table S1	`scripts/analysis/auxiliary/database_comparison.ipynb`	-

Figs 7, 8, and their supplementary panels additionally need the DYN cohort cache under results/metadata/<ts>/dyn-prep/tables/cohorts/ - shipped with the repo. Figs 1, 9, 10, and Fig S1 (read-length, resource benchmarking, recommended-workflow diagram, error rates) require the full pipeline or are hand-drawn; they are not metadata-reproducible.

To sweep a threshold (e.g. the published detection sensitivity panel 0.0 / 0.0001 / 0.001 / 0.01 / 0.1) without editing the notebook - and across every dataset in one command:

# Detection - 6 datasets × 5 thresholds = 30 runs
scripts/analysis/sweep_notebook.sh \
    --notebook scripts/analysis/analysis_detection.ipynb \
    --var THRESHOLD_PERCENT \
    --datasets "pacbio_low_input pacbio_standard_input ont_zymo_kit ont_qiagen_kit sim_pacbio sim_ont"

# Abundance - 4 datasets × 5 thresholds = 20 runs (abundance is mock-only)
scripts/analysis/sweep_notebook.sh \
    --notebook scripts/analysis/analysis_abundance.ipynb \
    --var MIN_ABUNDANCE \
    --datasets "pacbio_low_input pacbio_standard_input ont_zymo_kit ont_qiagen_kit"

Both use the default --values "0.0 0.0001 0.001 0.01 0.1" (override with --values "...").

See the wiki for full reproduction details - Reproducing the manuscript, including the full-pipeline path (raw reads → profilers → notebooks) for the figures above.

Full reproduction (from raw reads)

Only needed if you want to regenerate the analysis tables themselves (or reproduce Fig 1, Fig 9, Fig S1). Reference databases and the simulated dataset live on Zenodo (grouped in the Treangen Lab Bakeoff community); mock-community reads are obtained from their original SRA accessions. See the wiki for the full recipe:

Data fetch guide - Zenodo bundles, manual reassembly, post-extract steps
Reproducing the manuscript - per-figure / per-table walkthroughs (quick and full paths)
Database build - rebuilding the unified RefSeq-228 DB from scratch

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
data		data
results/metadata		results/metadata
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bakeoff - Long-read taxonomic profiling of the gut microbiome

Introduction

Install

1. Clone the repository

2. Create the conda environment

3. Verify the install

4. System tools for the data fetch (typically already present)

Quick reproduction

Full reproduction (from raw reads)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bakeoff - Long-read taxonomic profiling of the gut microbiome

Introduction

Install

1. Clone the repository

2. Create the conda environment

3. Verify the install

4. System tools for the data fetch (typically already present)

Quick reproduction

Full reproduction (from raw reads)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages