Skip to content

Commit a2a152f

Browse files
committed
add snpeff step
1 parent 4a5029a commit a2a152f

19 files changed

Lines changed: 121 additions & 84 deletions

README.md

Lines changed: 59 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,37 @@
1-
# TronFlow variant normalization pipeline
1+
# TronFlow VCF postprocessing
22

33
![GitHub tag (latest SemVer)](https://img.shields.io/github/v/release/tron-bioinformatics/tronflow-variant-normalization?sort=semver)
44
[![Run tests](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml)
55
[![DOI](https://zenodo.org/badge/372133189.svg)](https://zenodo.org/badge/latestdoi/372133189)
66
[![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)
77
[![Powered by Nextflow](https://img.shields.io/badge/powered%20by-Nextflow-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://www.nextflow.io/)
88

9-
The TronFlow variant normalization pipeline is part of a collection of computational workflows for tumor-normal pair
9+
The TronFlow VCF postprocessing pipeline is part of a collection of computational workflows for tumor-normal pair
1010
somatic variant calling.
11+
These workflows are implemented in the Nextflow (Di Tommaso, 2017) framework.
1112

1213
Find the documentation here [![Documentation Status](https://readthedocs.org/projects/tronflow-docs/badge/?version=latest)](https://tronflow-docs.readthedocs.io/en/latest/?badge=latest)
13-
1414

15-
This pipeline aims at normalizing variants represented in a VCF into the convened normal form as described in Tan 2015.
16-
The variant normalization is based on the implementation in vt (Tan 2015) and bcftools (Danecek 2021).
17-
The pipeline is implemented on the Nextflow (Di Tommaso 2017) framework.
15+
This pipeline has several objectives:
16+
* Variant filtering
17+
* Variant normalization
18+
* Technical annotations from different BAM files
19+
* Functional annotations
20+
21+
## Variant filtering
22+
23+
Optionally, only variants with the value in the column `FILTER` matching the value of parameter `--filter` are kept.
24+
If this parameter is not used not variants are filtered out. Multiple values can be passed separated by commas without spaces.
25+
26+
For instance, `--filter PASS,.` will keep variant having `FILTER=PASS` or `FILTER=.`, but remove all others.
27+
28+
29+
## Variant normalization
30+
31+
The normalization step aims to represent variants into the convened normal form as described in Tan 2015.
32+
The variant normalization is based on the implementation in vt (Tan, 2015) and bcftools (Danecek, 2021).
1833

19-
The pipeline consists of the following steps:
20-
* Variant filtering (optional)
34+
The normalization pipeline consists of the following steps:
2135
* Decomposition of MNPs into atomic variants (ie: AC > TG is decomposed into two variants A>T and C>G) (optional).
2236
* Decomposition of multiallelic variants into biallelic variants (ie: A > C,G is decomposed into two variants A > C and A > G)
2337
* Trim redundant sequence and left align indels, indels in repetitive sequences can have multiple representations
@@ -34,9 +48,9 @@ The output consists of:
3448
![Pipeline](images/variant_normalization_pipeline.png)
3549

3650

37-
## What is variant normalization?
51+
### What is variant normalization?
3852

39-
### Variants are trimmed removing redundant bases
53+
#### Variants are trimmed removing redundant bases
4054

4155
Before:
4256
```
@@ -49,7 +63,7 @@ chr1 13085 . A C . UNTRIMMED OLD_CLUMPED=chr1|13083|AGA|AGC GT:AD 0:276,276 0/1:
4963

5064
**NOTE**: when a variant is changed during normalization the original variant is kept in the `INFO` field `OLD_CLUMPED`.
5165

52-
### Indels are left aligned
66+
#### Indels are left aligned
5367

5468
If an indel lies within a repetitive sequence it can be reported on different positions, the convention is to report
5569
the left most indel. This is what is called left alignment.
@@ -69,7 +83,7 @@ chr1 13140 . CCTGAG C . UNALIGNED OLD_CLUMPED=chr1|13141|CTGAGG|G GT:AD 0:80,1 0
6983
**NOTE**: the rule of thumb to confirm if any given indel is left aligned, assuming that they are trimmed, is that the last base in both reference and
7084
alternate must differ.
7185

72-
### Multi-allelic variants are split.
86+
#### Multi-allelic variants are split.
7387

7488
Before:
7589
```
@@ -84,7 +98,7 @@ chr1 13204 . C T . MULTIALLELIC OLD_VARIANT=chr1|13204|C|G,|2 GT:AD 0:229,229 0/
8498
Note that the AD values are incorrectly set after the split, this seems to be a regression issue in bcftools v1.12
8599
and has been reported here https://github.com/samtools/bcftools/issues/1499.
86100

87-
### Multi Nucleotide Variants (MNVs) can be decomposed
101+
#### Multi Nucleotide Variants (MNVs) can be decomposed
88102

89103
Before:
90104
```
@@ -101,7 +115,7 @@ chr1 13266 . T C . MNV OLD_CLUMPED=chr1:13261:GCTCCT/CCCCCC GT:AD:PS 0:41,0:1326
101115
This behaviour is optional and can be disabled with `--skip_decompose_complex`. Beware, that the phase is maintained in
102116
the fields `FORMAT/GT` and `FORMAT/PS` as described in the VCF specification section 1.4.2.
103117

104-
### Complex variants combining SNV and indels can be decomposed
118+
#### Complex variants combining SNV and indels can be decomposed
105119

106120
Before:
107121
```
@@ -118,6 +132,36 @@ chr1 13325 . CT C . MNV-INDEL OLD_CLUMPED=chr1:13321:AGCCCT/CGCC GT:AD:PS 0:229,
118132
Same as MNVs this behaviour can de disabled with `--skip_decompose_complex`.
119133

120134

135+
## Technical annotations
136+
137+
The technical annotations provide an insight on the variant calling process by looking into the context of each variant
138+
within the pileup of a BAM file. When doing somatic variant calling it may be relevant to have technical annotations
139+
for the same variant in a patient from multiple BAM files.
140+
These annotations are provided by VAFator (https://github.com/TRON-Bioinformatics/vafator).
141+
142+
## Functional annotations
143+
144+
The functional annotations provide a biological context for every variant. Such as the overlapping genes or the effect
145+
of the variant in a protein. These annotations are provided by SnpEff (Cingolani, 2012).
146+
147+
The SnpEff available human annotations are:
148+
* GRCh37.75
149+
* GRCh38.99
150+
* hg19
151+
* hg19kg
152+
* hg38
153+
* hg38kg
154+
155+
Before running the functional annotations you will need to download the reference genome you need to use.
156+
This can be done as follows: `snpEff download -dataDir /your/snpeff/folder -v hg19`
157+
158+
When running indicate the right reference genome like `--snpeff_organism hg19`.
159+
If none is provided no SnpEff annotations will be provided.
160+
Provide the snpEff folder with `--snpeff_datadir`
161+
To provide any additional SnpEff arguments use `--snpeff_args` such as
162+
`--snpeff_args "-noStats -no-downstream -no-upstream -no-intergenic -no-intron -onlyProtein -hgvs1LetterAa -noShiftHgvs"`,
163+
otherwise defaults will be used.
164+
121165

122166
## How to run it
123167

@@ -194,3 +238,4 @@ The aggregated vafator annotations on each sample will also be provided without
194238
* Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204](http://bioinformatics.oxfordjournals.org/content/31/13/2202) and uses bcftools [Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (Oxford, England), 27(21), 2987–2993. 10.1093/bioinformatics/btr509
195239
* Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
196240
* Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. 10.1038/nbt.3820
241+
* Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.". Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672

bin/filter_passed.sh

Lines changed: 0 additions & 5 deletions
This file was deleted.

bin/normalization.sh

Lines changed: 0 additions & 47 deletions
This file was deleted.

bin/summary.sh

Lines changed: 0 additions & 7 deletions
This file was deleted.

main.nf

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@
22

33
nextflow.enable.dsl = 2
44

5-
include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/normalization'
6-
include { FILTER_VCF } from './modules/filter'
7-
include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/summary'
8-
include { VAFATOR; MULTIALLELIC_FILTER } from './modules/vafator'
5+
include { FILTER_VCF } from './modules/01_filter'
6+
include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/02_normalization'
7+
include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/03_summary'
8+
include { VAFATOR; MULTIALLELIC_FILTER } from './modules/04_vafator'
9+
include { VARIANT_ANNOTATION } from './modules/05_variant_annotation'
910

1011
params.help= false
1112
params.input_vcfs = false
@@ -21,13 +22,20 @@ params.vcf_without_ad = false
2122
params.mapping_quality = false
2223
params.base_call_quality = false
2324
params.skip_multiallelic_filter = false
25+
params.snpeff_organism = false
26+
params.snpeff_args = ""
27+
params.snpeff_datadir = false
2428

2529

2630
if (params.help) {
2731
log.info params.help_message
2832
exit 0
2933
}
3034

35+
if ( params.snpeff_organism && ! params.snpeff_datadir) {
36+
exit 1, "To run snpEff, please, provide your snpEff data folder with --snpeff_datadir"
37+
}
38+
3139
if (! params.input_vcfs && ! params.input_vcf) {
3240
exit 1, "Neither --input_vcfs or --input_vcf are provided!"
3341
}
@@ -81,5 +89,10 @@ workflow {
8189
final_vcfs = MULTIALLELIC_FILTER.out.filtered_vcf
8290
}
8391
}
92+
93+
if (params.snpeff_organism) {
94+
VARIANT_ANNOTATION(final_vcfs)
95+
final_vcfs = VARIANT_ANNOTATION.out.annotated_vcf
96+
}
8497
}
8598

File renamed without changes.

modules/05_variant_annotation.nf

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
params.memory = "3g"
2+
params.cpus = 1
3+
params.output = "."
4+
params.snpeff_datadir = false
5+
params.snpeff_organism = false
6+
params.snpeff_args = ""
7+
8+
9+
process VARIANT_ANNOTATION {
10+
cpus params.cpus
11+
memory params.memory
12+
publishDir "${params.output}/${name}", mode: "copy"
13+
14+
conda (params.enable_conda ? "bioconda::snpeff=5.0" : null)
15+
16+
input:
17+
tuple val(name), file(vcf)
18+
19+
output:
20+
tuple val(name), file("${name}.annotated.vcf") , emit: annotated_vcf
21+
22+
script:
23+
datadir_arg = params.snpeff_datadir ? "-dataDir ${params.snpeff_datadir}" : ""
24+
"""
25+
snpEff eff ${datadir_arg} ${params.snpeff_args} -nodownload ${params.snpeff_organism} ${vcf} > ${name}.annotated.vcf
26+
"""
27+
}

0 commit comments

Comments
 (0)