Skip to content

ehsanestaji/FastaGuard

Repository files navigation

FastaGuard

FastaGuard is a fast, explainable FASTA QC tool for validating assembly FASTA files before expensive downstream analysis.

The assembly FASTA gate before expensive QC.

It is not intended to compete with QUAST, BUSCO, BlobToolKit, FastQC, or MultiQC. FastaGuard is the earlier preflight and triage layer: the first command that answers whether a FASTA file is valid, sane, interpretable, and ready for downstream tools.

Before QUAST. Before BUSCO. Before BlobToolKit. Before annotation.
Run FastaGuard first.

Install

Recommended bioinformatics install:

mamba install -c conda-forge -c bioconda fastaguard

Verify the installed CLI:

fastaguard --version
fastaguard --schema

GitHub release binaries are also available for Linux and macOS:

tar -xzf fastaguard-v0.2.0-x86_64-unknown-linux-gnu.tar.gz
./fastaguard-v0.2.0-x86_64-unknown-linux-gnu/fastaguard --help

The v0.2.0 GitHub release binaries and source archive are published. Bioconda serves v0.2.0 for Linux x86_64, Linux ARM64, macOS Intel, and macOS Apple Silicon.

Local development build:

cargo build --release --locked

Quickstart

Run the assembly preflight check:

fastaguard sample.fa \
  --profile assembly \
  --out fastaguard_report.html \
  --json fastaguard.json \
  --tsv fastaguard.tsv \
  --multiqc fastaguard_mqc.json

Pipeline gate example:

fastaguard sample.fa --profile assembly --gate pipeline

The pipeline gate is the v0.3 assembly preset for workflow stop/go decisions. It fails on duplicate IDs, invalid characters, invalid FASTA structure, and high-N content. GC and length outliers remain advisory by default because they are routing signals, not proof of contamination or misassembly. To make an advisory finding block a pipeline, add it explicitly with --fail-on.

Inspect the machine-readable contract:

fastaguard --schema
fastaguard --finding-catalog
fastaguard --explain-finding high_n_rate

Build and run the local Docker image:

docker build -t fastaguard:local .
docker run --rm -v "$PWD:/data" fastaguard:local /data/sample.fa \
  --profile assembly \
  --out /data/fastaguard_report.html \
  --json /data/fastaguard.json \
  --tsv /data/fastaguard.tsv \
  --multiqc /data/fastaguard_mqc.json

Published BioContainers currently provides the v0.2 image, which does not include v0.3 gate behavior yet:

docker pull quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0

Exit codes:

0 = pass
1 = warnings above configured threshold
2 = hard QC failure
3 = invalid input / tool error

Product Thesis

FASTA files are everywhere, but FASTA QC is fragmented across ad hoc scripts, seqkit stats, assembly QC tools, completeness tools, contamination workflows, and pipeline-specific checks. Each is useful, but none is the simple default first command for:

Is this FASTA file valid, sane, interpretable, and ready for downstream tools?

FastaGuard fills that gap:

FastaGuard is a fast, explainable FASTA QC tool that validates assembly FASTA files, detects structural and composition red flags, and produces pipeline-ready reports before expensive downstream analysis.

Assembly Scope

FastaGuard is assembly-first.

fastaguard sample.fa \
  --profile assembly \
  --gate pipeline \
  --out fastaguard_report.html \
  --json fastaguard.json \
  --tsv fastaguard.tsv \
  --multiqc fastaguard_mqc.json

The MVP focuses on:

  • FASTA validity
  • invalid FASTA structure reports with explainable FAIL verdicts
  • duplicate IDs
  • duplicate sequences
  • invalid nucleotide/IUPAC characters
  • empty records
  • core assembly stats
  • N50, N90, L50, L90
  • GC, AT, N, and ambiguity rates
  • high-N scaffolds
  • gap runs
  • suspicious tiny contigs
  • explainable PASS / WARN / FAIL verdicts
  • machine-readable summaries, actions, scope, and provenance
  • stable JSON, TSV, HTML, and MultiQC-compatible outputs
  • length histogram and GC-vs-length plot data in JSON and HTML

v0.2 expands the assembly preflight layer with:

  • composition outliers
  • richer provenance, taxonomy context, and routing hints
  • hardened MultiQC and pipeline adoption material

v0.3 adds the assembly gate contract:

  • --gate pipeline for default workflow blocking behavior
  • gate.blocking_findings for machine stop/go decisions
  • checksum provenance with provenance.input_sha256
  • explicit advisory findings for evidence that should route follow-up QC rather than stop a pipeline by default

Positioning

FastaGuard should recommend deeper tools when they are appropriate:

  • QUAST for assembly quality evaluation
  • BUSCO for biological completeness
  • BlobToolKit for contamination and cobiont exploration
  • CheckM for microbial genome completeness and contamination
  • seqkit for ad hoc sequence operations

The strategic wedge is earlier:

FastaGuard catches FASTA-level assembly problems before expensive assembly QC.

Documentation

Status

v0.2.0 is published on GitHub with Linux and macOS release binaries. Bioconda serves v0.2.0 for linux-64, linux-aarch64, osx-64, and osx-arm64. BioContainers also publishes the pinned workflow image quay.io/biocontainers/fastaguard:0.2.0--hfa8f182_0.

The current development milestone is v0.3: evidence, checksum provenance, and the assembly gate contract. Published Bioconda and BioContainers packages remain v0.2.0 until a v0.3 release is cut.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages