Skip to content

Commit e6c107d

Browse files
authored
Merge pull request #3 from TRON-Bioinformatics/update-vafator
Update vafator to v1.2.0 + add SnpEff step
2 parents d423297 + a2bd6a6 commit e6c107d

20 files changed

Lines changed: 165 additions & 117 deletions

README.md

Lines changed: 76 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,37 @@
1-
# TronFlow variant normalization pipeline
1+
# TronFlow VCF postprocessing
22

33
![GitHub tag (latest SemVer)](https://img.shields.io/github/v/release/tron-bioinformatics/tronflow-variant-normalization?sort=semver)
44
[![Run tests](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml)
55
[![DOI](https://zenodo.org/badge/372133189.svg)](https://zenodo.org/badge/latestdoi/372133189)
66
[![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)
77
[![Powered by Nextflow](https://img.shields.io/badge/powered%20by-Nextflow-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://www.nextflow.io/)
88

9-
The TronFlow variant normalization pipeline is part of a collection of computational workflows for tumor-normal pair
9+
The TronFlow VCF postprocessing pipeline is part of a collection of computational workflows for tumor-normal pair
1010
somatic variant calling.
11+
These workflows are implemented in the Nextflow (Di Tommaso, 2017) framework.
1112

1213
Find the documentation here [![Documentation Status](https://readthedocs.org/projects/tronflow-docs/badge/?version=latest)](https://tronflow-docs.readthedocs.io/en/latest/?badge=latest)
13-
1414

15-
This pipeline aims at normalizing variants represented in a VCF into the convened normal form as described in Tan 2015.
16-
The variant normalization is based on the implementation in vt (Tan 2015) and bcftools (Danecek 2021).
17-
The pipeline is implemented on the Nextflow (Di Tommaso 2017) framework.
15+
This pipeline has several objectives:
16+
* Variant filtering
17+
* Variant normalization
18+
* Technical annotations from different BAM files
19+
* Functional annotations
20+
21+
## Variant filtering
22+
23+
Optionally, only variants with the value in the column `FILTER` matching the value of parameter `--filter` are kept.
24+
If this parameter is not used not variants are filtered out. Multiple values can be passed separated by commas without spaces.
25+
26+
For instance, `--filter PASS,.` will keep variant having `FILTER=PASS` or `FILTER=.`, but remove all others.
27+
28+
29+
## Variant normalization
30+
31+
The normalization step aims to represent variants into the convened normal form as described in Tan 2015.
32+
The variant normalization is based on the implementation in vt (Tan, 2015) and bcftools (Danecek, 2021).
1833

19-
The pipeline consists of the following steps:
20-
* Variant filtering (optional)
34+
The normalization pipeline consists of the following steps:
2135
* Decomposition of MNPs into atomic variants (ie: AC > TG is decomposed into two variants A>T and C>G) (optional).
2236
* Decomposition of multiallelic variants into biallelic variants (ie: A > C,G is decomposed into two variants A > C and A > G)
2337
* Trim redundant sequence and left align indels, indels in repetitive sequences can have multiple representations
@@ -34,9 +48,9 @@ The output consists of:
3448
![Pipeline](images/variant_normalization_pipeline.png)
3549

3650

37-
## What is variant normalization?
51+
### What is variant normalization?
3852

39-
### Variants are trimmed removing redundant bases
53+
#### Variants are trimmed removing redundant bases
4054

4155
Before:
4256
```
@@ -49,7 +63,7 @@ chr1 13085 . A C . UNTRIMMED OLD_CLUMPED=chr1|13083|AGA|AGC GT:AD 0:276,276 0/1:
4963

5064
**NOTE**: when a variant is changed during normalization the original variant is kept in the `INFO` field `OLD_CLUMPED`.
5165

52-
### Indels are left aligned
66+
#### Indels are left aligned
5367

5468
If an indel lies within a repetitive sequence it can be reported on different positions, the convention is to report
5569
the left most indel. This is what is called left alignment.
@@ -69,7 +83,7 @@ chr1 13140 . CCTGAG C . UNALIGNED OLD_CLUMPED=chr1|13141|CTGAGG|G GT:AD 0:80,1 0
6983
**NOTE**: the rule of thumb to confirm if any given indel is left aligned, assuming that they are trimmed, is that the last base in both reference and
7084
alternate must differ.
7185

72-
### Multi-allelic variants are split.
86+
#### Multi-allelic variants are split.
7387

7488
Before:
7589
```
@@ -84,7 +98,7 @@ chr1 13204 . C T . MULTIALLELIC OLD_VARIANT=chr1|13204|C|G,|2 GT:AD 0:229,229 0/
8498
Note that the AD values are incorrectly set after the split, this seems to be a regression issue in bcftools v1.12
8599
and has been reported here https://github.com/samtools/bcftools/issues/1499.
86100

87-
### Multi Nucleotide Variants (MNVs) can be decomposed
101+
#### Multi Nucleotide Variants (MNVs) can be decomposed
88102

89103
Before:
90104
```
@@ -101,7 +115,7 @@ chr1 13266 . T C . MNV OLD_CLUMPED=chr1:13261:GCTCCT/CCCCCC GT:AD:PS 0:41,0:1326
101115
This behaviour is optional and can be disabled with `--skip_decompose_complex`. Beware, that the phase is maintained in
102116
the fields `FORMAT/GT` and `FORMAT/PS` as described in the VCF specification section 1.4.2.
103117

104-
### Complex variants combining SNV and indels can be decomposed
118+
#### Complex variants combining SNV and indels can be decomposed
105119

106120
Before:
107121
```
@@ -118,6 +132,36 @@ chr1 13325 . CT C . MNV-INDEL OLD_CLUMPED=chr1:13321:AGCCCT/CGCC GT:AD:PS 0:229,
118132
Same as MNVs this behaviour can de disabled with `--skip_decompose_complex`.
119133

120134

135+
## Technical annotations
136+
137+
The technical annotations provide an insight on the variant calling process by looking into the context of each variant
138+
within the pileup of a BAM file. When doing somatic variant calling it may be relevant to have technical annotations
139+
for the same variant in a patient from multiple BAM files.
140+
These annotations are provided by VAFator (https://github.com/TRON-Bioinformatics/vafator).
141+
142+
## Functional annotations
143+
144+
The functional annotations provide a biological context for every variant. Such as the overlapping genes or the effect
145+
of the variant in a protein. These annotations are provided by SnpEff (Cingolani, 2012).
146+
147+
The SnpEff available human annotations are:
148+
* GRCh37.75
149+
* GRCh38.99
150+
* hg19
151+
* hg19kg
152+
* hg38
153+
* hg38kg
154+
155+
Before running the functional annotations you will need to download the reference genome you need to use.
156+
This can be done as follows: `snpEff download -dataDir /your/snpeff/folder -v hg19`
157+
158+
When running indicate the right reference genome like `--snpeff_organism hg19`.
159+
If none is provided no SnpEff annotations will be provided.
160+
Provide the snpEff folder with `--snpeff_datadir`
161+
To provide any additional SnpEff arguments use `--snpeff_args` such as
162+
`--snpeff_args "-noStats -no-downstream -no-upstream -no-intergenic -no-intron -onlyProtein -hgvs1LetterAa -noShiftHgvs"`,
163+
otherwise defaults will be used.
164+
121165

122166
## How to run it
123167

@@ -167,22 +211,31 @@ Output:
167211

168212
The table with VCF files expects two tab-separated columns without a header
169213

170-
| Sample name | VCF |
214+
| Patient name | VCF |
171215
|----------------------|------------------------------------------------------------------------|
172-
| sample_1 | /path/to/sample_1.vcf |
173-
| sample_2 | /path/to/sample_2.vcf |
216+
| patient_1 | /path/to/patient_1.vcf |
217+
| patient_2 | /path/to/patient_2.vcf |
174218

175-
The optional table with BAM files expects three tab-separated columns without a header. Multiple comma-separated BAMs can be provided.
219+
The optional table with BAM files expects two tab-separated columns without a header.
176220

177-
| Sample name | Tumor BAMs | Normal BAMs |
178-
|----------------------|---------------------------------|------------------------------|
179-
| sample_1 | /path/to/sample_1.tumor_1.bam,/path/to/sample_1.tumor_2.bam | /path/to/sample_1.normal_1.bam,/path/to/sample_1.normal_2.bam |
180-
| sample_2 | /path/to/sample_2.tumor.bam | /path/to/sample_2.normal.bam |
221+
| Patient name | Sample name:BAM |
222+
|----------------------|---------------------------------|
223+
| patient_1 | primary_tumor:/path/to/sample_1.primary.bam |
224+
| patient_1 | metastasis_tumor:/path/to/sample_1.metastasis.bam |
225+
| patient_1 | normal:/path/to/sample_1.normal.bam |
226+
| patient_2 | primary_tumor:/path/to/sample_1.primary_1.bam |
227+
| patient_2 | primary_tumor:/path/to/sample_1.primary_2.bam |
228+
| patient_2 | metastasis_tumor:/path/to/sample_1.metastasis.bam |
229+
| patient_2 | normal:/path/to/sample_1.normal.bam |
230+
231+
Each patient can have any number of samples. Any sample can have any number of BAM files, annotations from the
232+
different BAM files of the same sample will be provided with suffixes _1, _2, etc.
233+
The aggregated vafator annotations on each sample will also be provided without a suffix.
181234

182-
183235

184236
## References
185237

186238
* Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204](http://bioinformatics.oxfordjournals.org/content/31/13/2202) and uses bcftools [Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (Oxford, England), 27(21), 2987–2993. 10.1093/bioinformatics/btr509
187239
* Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
188240
* Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. 10.1038/nbt.3820
241+
* Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.". Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672

bin/filter_passed.sh

Lines changed: 0 additions & 5 deletions
This file was deleted.

bin/normalization.sh

Lines changed: 0 additions & 47 deletions
This file was deleted.

bin/summary.sh

Lines changed: 0 additions & 7 deletions
This file was deleted.

main.nf

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@
22

33
nextflow.enable.dsl = 2
44

5-
include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/normalization'
6-
include { FILTER_VCF } from './modules/filter'
7-
include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/summary'
8-
include { VAFATOR; MULTIALLELIC_FILTER } from './modules/vafator'
5+
include { FILTER_VCF } from './modules/01_filter'
6+
include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/02_normalization'
7+
include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/03_summary'
8+
include { VAFATOR; MULTIALLELIC_FILTER } from './modules/04_vafator'
9+
include { VARIANT_ANNOTATION } from './modules/05_variant_annotation'
910

1011
params.help= false
1112
params.input_vcfs = false
@@ -21,14 +22,20 @@ params.vcf_without_ad = false
2122
params.mapping_quality = false
2223
params.base_call_quality = false
2324
params.skip_multiallelic_filter = false
24-
params.prefix = false
25+
params.snpeff_organism = false
26+
params.snpeff_args = ""
27+
params.snpeff_datadir = false
2528

2629

2730
if (params.help) {
2831
log.info params.help_message
2932
exit 0
3033
}
3134

35+
if ( params.snpeff_organism && ! params.snpeff_datadir) {
36+
exit 1, "To run snpEff, please, provide your snpEff data folder with --snpeff_datadir"
37+
}
38+
3239
if (! params.input_vcfs && ! params.input_vcf) {
3340
exit 1, "Neither --input_vcfs or --input_vcf are provided!"
3441
}
@@ -50,8 +57,8 @@ else if (params.input_vcf) {
5057
if (params.input_bams) {
5158
Channel
5259
.fromPath(params.input_bams)
53-
.splitCsv(header: ['name', 'tumor_bams', 'normal_bams'], sep: "\t")
54-
.map{ row-> tuple(row.name, row.tumor_bams, row.normal_bams) }
60+
.splitCsv(header: ['name', 'sample_name', 'bam'], sep: "\t")
61+
.map{ row-> tuple(row.name, row.sample_name, row.bam) }
5562
.set { input_bams }
5663
}
5764

@@ -74,15 +81,18 @@ workflow {
7481

7582
SUMMARY_VCF_2(final_vcfs)
7683

77-
if ( params.input_bams) {
78-
VAFATOR(final_vcfs.join(input_bams))
84+
if ( params.input_bams ) {
85+
VAFATOR(final_vcfs.join(input_bams.groupTuple()))
7986
final_vcfs = VAFATOR.out.annotated_vcf
8087
if ( ! params.skip_multiallelic_filter ) {
8188
final_vcfs = MULTIALLELIC_FILTER(final_vcfs)
8289
final_vcfs = MULTIALLELIC_FILTER.out.filtered_vcf
8390
}
8491
}
8592

86-
final_vcfs.map {it.join("\t")}.collectFile(name: "${params.output}/normalized_vcfs.txt", newLine: true)
93+
if (params.snpeff_organism) {
94+
VARIANT_ANNOTATION(final_vcfs)
95+
final_vcfs = VARIANT_ANNOTATION.out.annotated_vcf
96+
}
8797
}
8898

File renamed without changes.
Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,33 +4,31 @@ params.output = ""
44
params.mapping_quality = false
55
params.base_call_quality = false
66
params.skip_multiallelic_filter = false
7-
params.prefix = false
7+
params.enable_conda = false
88

99

1010
process VAFATOR {
1111
cpus params.cpus
1212
memory params.memory
13-
tag "${name}"
14-
publishDir "${params.output}/${name}", mode: "copy"
13+
tag "${patient_name}"
14+
publishDir "${params.output}/${patient_name}", mode: "copy"
1515

16-
conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null)
16+
conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null)
1717

1818
input:
19-
tuple val(name), file(vcf), val(normal_bams), val(tumor_bams)
19+
tuple val(patient_name), file(vcf), val(bams)
2020

2121
output:
22-
tuple val(name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf
22+
tuple val(patient_name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf
2323

2424
script:
25-
normal_bams_param = normal_bams?.trim() ? "--normal-bams " + normal_bams.split(",").join(" ") : ""
26-
tumor_bams_param = tumor_bams?.trim() ? "--tumor-bams " + tumor_bams.split(",").join(" ") : ""
25+
bams_param = bams.collect { b -> "--bam " + b.split(":").join(" ") }.join(" ")
2726
mq_param = params.mapping_quality ? "--mapping-quality " + params.mapping_quality : ""
2827
bq_param = params.base_call_quality ? "--base-call-quality " + params.base_call_quality : ""
29-
prefix_param = params.prefix ? "--prefix " + params.prefix : ""
3028
"""
3129
vafator \
3230
--input-vcf ${vcf} \
33-
--output-vcf ${vcf.baseName}.vaf.vcf ${normal_bams_param} ${tumor_bams_param} ${mq_param} ${bq_param} ${prefix_param}
31+
--output-vcf ${vcf.baseName}.vaf.vcf ${bams_param} ${mq_param} ${bq_param}
3432
"""
3533
}
3634

@@ -41,13 +39,13 @@ process MULTIALLELIC_FILTER {
4139
tag "${name}"
4240
publishDir "${params.output}/${name}", mode: "copy"
4341

44-
conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null)
42+
conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null)
4543

4644
input:
4745
tuple val(name), file(vcf)
4846

4947
output:
50-
tuple val(name), file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf
48+
tuple file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf
5149

5250
script:
5351
"""

modules/05_variant_annotation.nf

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
params.memory = "3g"
2+
params.cpus = 1
3+
params.output = "."
4+
params.snpeff_datadir = false
5+
params.snpeff_organism = false
6+
params.snpeff_args = ""
7+
8+
9+
process VARIANT_ANNOTATION {
10+
cpus params.cpus
11+
memory params.memory
12+
publishDir "${params.output}/${name}", mode: "copy"
13+
14+
conda (params.enable_conda ? "bioconda::snpeff=5.0" : null)
15+
16+
input:
17+
tuple val(name), file(vcf)
18+
19+
output:
20+
tuple val(name), file("${name}.annotated.vcf") , emit: annotated_vcf
21+
22+
script:
23+
datadir_arg = params.snpeff_datadir ? "-dataDir ${params.snpeff_datadir}" : ""
24+
"""
25+
snpEff eff ${datadir_arg} ${params.snpeff_args} -nodownload ${params.snpeff_organism} ${vcf} > ${name}.annotated.vcf
26+
"""
27+
}

0 commit comments

Comments
 (0)