mapper_v3.py

Maps insertion flanking sequences from a FASTA/FASTQ file to a target genome (GenBank format) using exact k-mer matching. For each read, the script locates a user-defined flanking sequence at the edge of an insertion element, extracts the adjacent genomic k-mer, and maps it to the target genome. Optionally filters out reads that map to a donor plasmid, merges adjacent insertion calls caused by sequencing errors, and outputs insertion site maps, distance plots, and sequence logos.

Usage

python mapper_v3.py --fasta_file <reads> --genome <genome.gb> --flanking_seq <sequence> [options]

Minimal example:

python mapper_v3.py \
  --fasta_file raw_data/sample.fastq \
  --genome genomes/MG1655.gb \
  --flanking_seq gttgtattatcctagggata \
  --output my_sample

Arguments

Required

Argument	Short	Description
`--fasta_file`	`-f`	Path to input FASTA or FASTQ file of raw insertion mapping reads
`--genome`	`-g`	Path to target genome GenBank (.gb) file
`--flanking_seq`	`-s`	Sequence at the edge of the insertion element (20–25 bp recommended). Must contain only A/T/C/G

Optional

Argument	Short	Default	Description
`--donor`	`-d`	None	Path to donor plasmid GenBank file. Reads that map to this plasmid are separated out to identify donor contamination
`--insertion_end`	`-e`	`left_end`	Which end of the insertion to analyze. Choices: `left_end`, `right_end`
`--mapping_kmer`	`-k`	`25`	K-mer length (bp) used for genome mapping. This many bp of genomic sequence adjacent to the insertion end must exactly match for a mapping call
`--context`	`-l`	`7`	Length of sequence extracted on each side of the insertion coordinate for sequence logos and Needleman–Wunsch similarity plots. Keep at `7` for IS621 (14 bp context window); changing this may produce unpredictable results
`--min_mean_qual`	`-q`	`10`	Minimum mean Phred quality score for FASTQ reads to be retained. Ignored for FASTA input
`--merge_adjacent`	`-m`	`5`	Window size (bp) for merging nearby insertion coordinates. Corrects for sequencing errors that shift the insertion coordinate by a few bp. Set to `0` to disable
`--output`	`-o`	`mapping_results`	Prefix for all output files
`--plot`	`-p`	`True`	Generate insertion maps, distance plots, and sequence logos
`--highlight`	`-t`	None	Comma-separated list of sequences to highlight on insertion and distance plots. Example: `ACGT,TTGG`
`--classify`	`-c`	`Target-like=AGCCGCGGTAATAC,` `Donor-like=ACAGTATGTTGTAT`	Classify insertion sites by sequence motif. Format: `label1=seq1,label2=seq2`. Sequences should be 2× context length (14 bp by default)
`--classify_limit`	`-b`	`inf`	Maximum Levenshtein distance allowed when classifying insertion sites by motif. Sites exceeding this threshold are placed in an unclassified bin
`--read_cutoff`	`-r`	`0`	Exclude insertion sites below this percent abundance from sequence logo plots
`--verbose`	`-v`	`1`	Logging verbosity. Use `-v` for major pipeline steps, `-vv` for detailed debug output

Output Files

All output files are prefixed with the value of --output.

File	Description
`{output}_flanks.csv`	All flanking sequences extracted from reads before mapping
`{output}_genome_unique_mapped_reads.csv`	All reads that mapped to exactly one genome location
`{output}_genome_non-unique_mapped_reads.csv`	Reads that mapped to multiple genome locations (excluded from analysis)
`{output}_donorplasmid_unique_mapped_reads.csv`	Reads that mapped to the donor plasmid (only if `--donor` is provided)
`{output}_contigs.csv`	Genome contig names and sizes
`{output}_genome_unique_mapped_reads_counted.csv`	Final tabulated and merged insertion sites with read counts
`{output}_donorplasmid_unique_mapped_reads_counted.csv`	Tabulated donor plasmid insertion sites (only if `--donor` is provided)
`{output}_log.log`	Run log
Plots	Insertion maps, distance/PCA plots, and sequence logos (if `--plot True`)

Dependencies

Python 3
Biopython
pandas
matplotlib
numpy
logomaker
python-Levenshtein
scikit-learn
umap-learn (optional — only required if using UMAP dimensionality reduction)
MAFFT (external tool — must be installed and available on $PATH)
plotter_v3.py (must be in the same directory)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
mapper_v3.py		mapper_v3.py
plotter_v3.py		plotter_v3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mapper_v3.py

Usage

Arguments

Required

Optional

Output Files

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mapper_v3.py

Usage

Arguments

Required

Optional

Output Files

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages