easymsa-prep cleans FASTA input before EasyMSA alignment. It accepts a FASTA
file, compressed FASTA, directory, or archive, then writes:
clean.fasta
result.json
summary.txt
No subcommands are needed.
sudo apt update
sudo apt install -y cmake g++ libomp-dev zlib1g-dev liblzma-devBuild:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jOptional local install:
cmake --install build --prefix "$HOME/.local"
export PATH="$HOME/.local/bin:$PATH"Check:
easymsa-prep --helpIf you do not install it, use build/easymsa-prep in the commands below.
Run with defaults:
easymsa-prep input.fasta.gz -o resultThe default is --mode audit --strictness normal: it normalizes sequence text,
renames unsafe IDs, removes empty/all-unknown records, and reports quality
outliers as warnings instead of deleting them. Use --mode filter when the
backend should automatically drop high-N, high-illegal-character, length
outlier, low-complexity, and low-similarity records. Use --strictness strict
or --strictness lenient to change QC thresholds.
Overwrite an existing result directory:
easymsa-prep input.fasta.gz -o result --forceUse more threads:
easymsa-prep input.fasta.gz -o result --threads 8 --forceRun the bundled examples:
bash examples/run_examples.shSupported inputs:
- FASTA:
.fasta,.fa,.fna,.ffn,.txt - Compressed FASTA:
.gz,.xz - Archives:
.zip,.tar.gz,.tgz,.tar.xz - Directories containing supported FASTA files
Archive members are checked for path traversal, absolute paths, hidden/system files, nested archives, excessive file count, and excessive uncompressed size.
Success:
result/
clean.fasta
result.json
summary.txt
Failure:
result/
result.json
summary.txt
clean.fasta is not produced on failure.
Backend workers should pass only clean.fasta to the MSA stage and treat
result.json as the machine-readable source of truth.
easymsa-prep <input> -o <output_dir> [options]--mode audit|filter: defaultaudit.--strictness strict|normal|lenient: defaultnormal.--params params.json: override thresholds, rules, limits, or output settings.--threads N: OpenMP threads; default ismin(OpenMP max threads, 8).--batch-size N: default10000.--force: overwrite generated output files.--quiet: reduce terminal output.--version,--help.
Mode summary:
| Mode | Default behavior |
|---|---|
audit |
Check and report suspicious records without dropping them |
filter |
Check, remove records that exceed QC thresholds, and collapse exact duplicates |
Strictness summary:
| Strictness | Threshold behavior |
|---|---|
strict |
Tight thresholds for cleaner, more uniform inputs |
normal |
Default middle setting for most runs |
lenient |
Wider thresholds for noisy or diverse inputs |
Use --params to customize filtering:
easymsa-prep input.fasta.gz -o result --params params.json --forceMinimal example:
{
"mode": "filter",
"strictness": "normal",
"thresholds": {
"max_n_ratio": 0.25,
"max_illegal_char_ratio": 0.02
},
"rules": {
"gaps": "remove",
"illegal_characters": "replace_with_N",
"high_n_ratio": "remove",
"too_short": "remove",
"too_long": "warn_only",
"identical_sequences": "collapse"
}
}Detailed documentation:
0: success; all three output files exist.1: input, params, archive, FASTA, limit, or runtime error.2: command-line error.
When the output directory is known, errors write result.json and summary.txt.
Build and test:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failureSource files are intentionally flat under src/:
cli.cpp: command line parsingconfig.cpp: mode, strictness, and params JSONinput.cpp: discovery, archive extraction, FASTA readingengine.cpp: QC, cleaning, reference selection, dedupreports.cpp:result.jsonandsummary.txtsequence_utils.cpp,common.cpp: shared helpers
Vendored dependencies: CLI11, miniz, doctest, kseq.h, json.hpp,
xxhash. System dependencies: OpenMP, zlib, liblzma.