- Running Strainify without a precomputed variant matrix
- Running Strainify with a precomputed variant matrix
- Example Output
- Clustering input reference genomes
- Unknown strains in metagenomic samples
Strainify includes example input data to help you get started quickly.
Build and activate the environment once (see the README), then run Strainify on the paired-end example data.
conda activate strainify
./strainify \
--genome_folder example/genomes \
--fastq_folder example/fastqs/paired \
--outdir example/resultspaired is the default read type, so --read_type is omitted here; for single-end data pass
--read_type single (as in the next section). The output is written to example/results.
If you are running Strainify again on the same set of genomes, you can reuse the precomputed
variant matrix by passing --use_precomputed_variants along with --precomputed_dir (the
directory that contains the previous run's filtered_variant_matrix.csv, reference.fna,
and sites.txt) and a fresh --outdir for the new results. Reusing the matrix from the
previous example:
./strainify \
--genome_folder example/genomes \
--fastq_folder example/fastqs/single \
--read_type single \
--use_precomputed_variants \
--precomputed_dir example/results \
--outdir example/new_outputThe output will be written to the example/new_output directory.
The abundance estimates are stored in a CSV file named abundance_estimates_combined.csv.
- Each row corresponds to a strain.
- Each column (after the first) corresponds to a sample.
- Values represent the estimated relative abundances.
Example CSV output:
strain name,10x_ratio_2
E24377A,23.7987
H10407,24.8376
Sakai,24.9406
UTI89,26.423Note: these numbers are percentages.
Other important output files:
sites.txtcontains a list of variant positions that passed the filter. Read counts supporting the allele and reference base at these positions are then obtained and used as input to the MLE model.filtered_variant_matrix.csvcontains the filtered variant matrix. Confounding variants (potential recombination sites) have been removed. For metagenomic samples that share the same set of strains (i.e. query genomes), this file can be reused to avoid rerunning the genome alignment and variant filtering steps. For more details, see instructions above for running Strainify with a precomputed variant matrix.significantly_enriched_windows.tsvcontains the start and end coordinates of windows that are flagged as potential recombination sites. The z-score and p-values of each window are also shown. Variants in these windows are removed from downstream analysis (i.e. excluded from the filtered variant matrix).abundance_estimates_bootstrap_CIs.csv(if--bootstrapis applied) contains the 95% confidence intervals for each estimated relative abundance value.
By default, Strainify defines each input reference genome as a strain, and it is able to operate with strains that are highly similar provided a reference genome is available for each one. However, strains can also be defined at a less granular level by clustering similar reference genomes and selecting a representative sequence from each cluster as input for Strainify.
A simple tool that can be used for clustering is TreeCluster. You can first run Parsnp (included in Strainify's dependencies, no need to install separately) with your reference genomes to obtain a tree (one of Parsnp's default outputs). For example:
parsnp -r reference_genome.fna -d path/to/genome/folder/*.fnaTip: if you would like Parsnp to randomly select a genome from the input as the reference for alignment, you can simply replace
reference_genome.fnawith!.
Look for parsnp.tree in Parsnp's output, and provide it to TreeCluster. Here is an example command:
TreeCluster.py -i parsnp.tree -m max_clade -t 0.001 -o clusters.tsvYou can then select a representative genome from each cluster and provide them to Strainify. Strainify will then estimate relative abundances at the cluster level.
When strains in a metagenomic sample to be analyzed are unknown (i.e. reference genomes are not available), Strainify can be paired with an upstream strain identification tool such as StrainGST. Just run the upstream tool and collect the genomes that it identified in your metagenomic samples, and then run Strainify with these genomes as references. For longitudinal/time-series studies, we recommend running the upstream tool on metagenomic samples from all timepoints, and provide all of the identified genomes to Strainify.
Two helper scripts in helpers/ automate the StrainGST hand-off.
You must install StrainGE yourself. Install it in a separate conda environment and activate that environment before running these scripts (the NCBI download route also needs
ncbi-genome-downloadin that environment). If you'd rather not activate it, the scripts accept the optional--strainge-env <name>(and--download-env <name>for the NCBI step) to run the tools in the named env viaconda run -ninstead.
1. Build a StrainGST database (one-time). Either download reference genomes from NCBI by genus/species, or point at a folder of genomes you already have:
# from NCBI:
helpers/build_straingst_db.sh \
--taxa "Escherichia,Shigella" \
--threads 16 --outdir straingst_db
# or from a custom genome folder:
helpers/build_straingst_db.sh \
--genome-folder my_reference_genomes/ \
--outdir straingst_dbThis produces straingst_db/pan-genome-db.hdf5 (the database) and straingst_db/db/ (the
reference genomes). Tune with --kmer-size, --cluster-jaccard, and --cluster-subset
(run with --help for all options).
2. Discover strains per sample and collect their genomes. Run StrainGST on your FASTQs; for each sample it lists the discovered strains and copies their genome FASTAs into a folder ready for Strainify:
helpers/run_straingst.sh \
--db straingst_db/pan-genome-db.hdf5 \
--ref-dir straingst_db \
--fastq-folder path/to/fastqs --read-type single \
--outdir straingst_resultsFor each <sample> this writes straingst_results/<sample>/<sample>.strains.txt (the discovered
strains), a combined straingst_results/discovered_strains.tsv summary, and
straingst_results/<sample>/genomes/ (the genome FASTAs).
StrainGST parameters are fully tunable: pass any straingst run flags through --run-args "..."
(for example -i to cap how many strains are reported, or --minscore to raise the minimum
score), any straingst kmerize flags through --kmerize-args "...", and set the k-mer size with
--kmer-size (it must match the database). Run the script with --help for the full list.
3. Run Strainify on the discovered genomes. Point --genome_folder at a sample's collected
genomes:
conda activate strainify
./strainify run \
--genome_folder straingst_results/<sample>/genomes \
--fastq_folder path/to/fastqs \
--read_type single \
--outdir results/<sample>For longitudinal/time-series studies, pool the genomes discovered across all timepoints into a single folder and pass that to Strainify, so every timepoint is profiled against the same strain set.