Companion repository for Huang et al., "Critical Evaluation of Long Read Taxonomic Profiling of the Gut Microbiome" (manuscript in preparation).
This study benchmarks six long-read-capable taxonomic profilers - Kraken2, Centrifuge, Centrifuger, Ganon2, Sourmash, and Sylph - across three dataset categories (PBSIM3 + MIMIC simulated reads, ZymoBIOMICS D6331 mock community, and the DYN clinical cohort) on both PacBio HiFi and Oxford Nanopore reads. Each method is evaluated under (a) its native default database and (b) a shared unified RefSeq-228 build (~69k genomes spanning archaea, bacteria, fungi, and viruses), holding the reference fixed to disentangle database effects from algorithm effects.
Three findings the paper argues:
- Database recency matters across the board. Database choice materially affects accuracy regardless of the tool, especially for lower-abundance taxa. Harmonizing to a unified RefSeq reference visibly reduced cross-tool disagreement on both mock and clinical samples - keeping the reference current is best practice regardless of the algorithm.
- Sketch-based profilers led the controlled benchmarks. Across simulated and mock data, Sourmash, Sylph, and Ganon2 posted the strongest detection performance, generally driven by higher precision at comparably high recall.
- Controlled-benchmark rankings transfer only indirectly to real data. On the clinical cohort, cross-tool agreement among the more accurate methods held for moderate-to-high-abundance taxa but weakened on low-abundance ones - so a multi-method composite approach may still be valuable in practice.
Prior long-read benchmarks emphasized short reads, used method-specific databases (confounding algorithm vs. DB), and relied on narrow synthetic communities. This study fixes the reference and tests across mock, simulated, and clinical data to isolate where each effect lives.
git clone https://github.com/treangenlab/bakeoff.git
cd bakeoffThe full pipeline environment - the six paper profilers (version-pinned to manuscript Methods), profiler-adjacent helpers (sylph-tax, taxonkit, multitax, mosdepth, lemur), and the analysis stack (Python, Jupyter, scikit-learn, seaborn, ete3, biopython, etc.) - is declared in config/bakeoff_env.yaml:
conda env create -n bakeoff -f config/bakeoff_env.yaml
conda activate bakeoffMamba works identically (mamba env create ...) and is significantly faster on this env.
Pinned profiler versions:
| Tool | Version | Notes |
|---|---|---|
| Kraken2 | 2.1.5 | |
| Centrifuge | 1.0.4.2 | |
| Centrifuger | 1.0.9 | |
| Ganon2 | 2.1.1 | |
| Sourmash | 4.9.0 | |
| Sylph | 0.8.1 | |
| sylph-tax | 1.7.0 | Other versions silently produce NO_TAXONOMY rows against the manuscript's metadata. |
for t in kraken2 centrifuge centrifuger ganon sourmash sylph; do
command -v $t >/dev/null && echo "$t: $(command -v $t)" || echo "$t: MISSING"
doneAll six should resolve to paths inside the active conda env.
curl, jq, zstd ≥ 1.4, tar, sha256sum. Quick check:
for c in curl jq zstd tar sha256sum; do
command -v $c >/dev/null && echo "$c: OK" || echo "$c: MISSING"
doneSeven of the ten manuscript figures (Figs 2-8) plus Table S1 and Table S2 reproduce from the analysis tables shipped with this repository - no reference databases, no read files, no profiler runs needed. After installing the env, just execute a notebook:
# cd into the bakeoff project root first
conda activate bakeoff
jupyter nbconvert --to notebook --execute scripts/analysis/analysis_detection.ipynbThe figure (and per-rank metric tables) land under results/<ts>/detection/threshold_<x>/<dataset>/{tables,figures}/. To render a different dataset or figure, change the DATASET (and threshold where relevant) in the notebook's config cell at the top:
| Fig / Table | Notebook | What to set |
|---|---|---|
| 2, 3, S2 | scripts/analysis/analysis_detection.ipynb |
DATASET (mock or simulated key) |
| 4, Table S2 | scripts/analysis/auxiliary/sensitivity_floor.ipynb |
- (pools all four D6331 mocks; emits Fig 4 panels + the rare_tier_detection_matrix.csv that backs Table S2) |
| 5, 6, S3, S4, S5 | scripts/analysis/analysis_abundance.ipynb |
DATASET (mock key) |
| 7, S6, S7 | scripts/analysis/dyn_heatmap.ipynb |
DATASET, MIN_ABUNDANCE |
| 8, S8 | scripts/analysis/dyn_alpha_div.ipynb |
cohorts, MIN_ABUNDANCE |
| Table S1 | scripts/analysis/auxiliary/database_comparison.ipynb |
- |
Figs 7, 8, and their supplementary panels additionally need the DYN cohort cache under results/metadata/<ts>/dyn-prep/tables/cohorts/ - shipped with the repo. Figs 1, 9, 10, and Fig S1 (read-length, resource benchmarking, recommended-workflow diagram, error rates) require the full pipeline or are hand-drawn; they are not metadata-reproducible.
To sweep a threshold (e.g. the published detection sensitivity panel 0.0 / 0.0001 / 0.001 / 0.01 / 0.1) without editing the notebook - and across every dataset in one command:
# Detection - 6 datasets × 5 thresholds = 30 runs
scripts/analysis/sweep_notebook.sh \
--notebook scripts/analysis/analysis_detection.ipynb \
--var THRESHOLD_PERCENT \
--datasets "pacbio_low_input pacbio_standard_input ont_zymo_kit ont_qiagen_kit sim_pacbio sim_ont"
# Abundance - 4 datasets × 5 thresholds = 20 runs (abundance is mock-only)
scripts/analysis/sweep_notebook.sh \
--notebook scripts/analysis/analysis_abundance.ipynb \
--var MIN_ABUNDANCE \
--datasets "pacbio_low_input pacbio_standard_input ont_zymo_kit ont_qiagen_kit"Both use the default --values "0.0 0.0001 0.001 0.01 0.1" (override with --values "...").
See the wiki for full reproduction details - Reproducing the manuscript, including the full-pipeline path (raw reads → profilers → notebooks) for the figures above.
Only needed if you want to regenerate the analysis tables themselves (or reproduce Fig 1, Fig 9, Fig S1). Reference databases and the simulated dataset live on Zenodo (grouped in the Treangen Lab Bakeoff community); mock-community reads are obtained from their original SRA accessions. See the wiki for the full recipe:
- Data fetch guide - Zenodo bundles, manual reassembly, post-extract steps
- Reproducing the manuscript - per-figure / per-table walkthroughs (quick and full paths)
- Database build - rebuilding the unified RefSeq-228 DB from scratch