This is a nextflow-based pipeline designed to process ATAC-seq data based on the ENCODE's ATAC-seq pipeline. It attempts to replicate the commands that would normally be processed by ENCODE, but in a Nextflow-native format.
Please see the later section for more details.
Please be sure you cite ENCODE's ATAC-seq pipeline if you use this pipeline:
Additionally, a select set of nf-core modules are used in this pipeline. Please be sure to cite these as well:
Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
To use this workflow, you MUST have:
Additionally, it is HIGHLY recommended that you have one of:
The pipeline can be installed using the following command:
nextflow pull WeirauchLab/nf-encode-atacIf that fails, you can manually install the pipeline by cloning the repository:
git clone WeirauchLab/nf-encode-atacThe pipeline requires a samplesheet in CSV format. The samplesheet should have a combination of the following columns:
The samplesheet is a CSV file that contains the following columns:
| Column | Required | Default | Description |
|---|---|---|---|
| id | Yes | The sample ID | |
| group | Yes | A group name. Anything matching this gets treated as a replicate | |
| control_sample_id | No | The ID of the control sample. | |
| control_group_id | No | The ID of the control group | |
| fastq_1 | Yes | The path to the first fastq file | |
| fastq_2 | No | The path to the second fastq file | |
| adapter_1 | No | adapter sequence to trim for read 1. automatic if not supplied | |
| adapter_2 | No | adapter sequence to trim for read 2. automatic if not supplied |
An example samplesheet is shown below:
id,group,fastq_1,fastq_2
example1,example,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
example2,example,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz
example2,example,/path/to/example2_R1_2.fastq.gz,/path/to/example2_R2_2.fastq.gzIf the fastq files for a sample are split across multiple files, you can specify multiple rows for the same sample ID. They will be merged together (see "example2" in the csv table above).
A note about adapters:
If you do not specify an adapter, the pipeline will let fastp attempt to automatically detect the adapter sequence.
If you do specify an adapter, the pipeline will use that adapter sequence for trimming.
There are two params that can be used as well, adapter_1 and adapter_2, which can be used to globally specify adapter sequences.
If these are set, any adapter sequences NOT specified in the samplesheet will use these values.
Control samples are typically IgG or input samples. This pipeline can handle this on a per-sample or per-group basis.
To implement this, you can specify either the control_sample_id or control_group_id column.
- Specify a sample ID that is a paired control
id,control_sample_id,group,fastq_1,fastq_2
target1,input1,example,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
target2,input2,example,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
input1,,example,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz
input2,,example,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gzIf a control_sample_id is specified, the control_group_id column will be filled in during the pipeline.
- Specify a group name that is a control group
id,control_group_id,group,fastq_1,fastq_2
target1,input,target,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
target2,input,target,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
input1,,input,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz
input2,,input,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gzWhen doing it this way, the pooled control group will be used for all samples in the group.
At minimum, you need the following:
- reference genome in fasta format
It is recommended that you also have:
- GTF annotation file
- region exclusion bed file (if applicable)
This is technically optional, but it is highly recommended.
A parameters file can be in either JSON or YAML format.
This file should contain the settings for the pipeline in a key-value format.
All available parameters can be found in the nextflow.config file and additional
validation information can be found in nextflow_schema.json.
A basic example may look like:
{
"input": "samplesheet.csv",
"outdir": "results",
"fasta": "/path/to/genome.fa",
"gtf": "/path/to/annotation.gtf"
}To run the pipeline, you can use the following command:
# Basic command
nextflow run WeirauchLab/nf-encode-atac -params-file params.json
# Use profiles to specify execution profiles. Here, docker is used.
nextflow run WeirauchLab/nf-encode-atac -profile docker -params-file params.jsonIf all goes well, you should see the pipeline start processing your data.
Pipelines can also be resumed if needed by adding the -resume flag. This will
require that the workDir directory is still present.
This pipeline is specifically based on ENCODE-DCC/atac-seq-pipeline:2.0.0. The pipeline is designed to replicate the commands that would normally be processed by ENCODE, but in a Nextflow-native format. This was done by looking through the repository, dissecting the commands, and converting them to Nextflow processes. Several steps were validated by looking at the scripts run by Cromwell.
There are a few minor differences:
- SPP peak calling is not included.
- SPP's fragment estimation is performed on the full library, not just a subsampled tagAlign.
- SPP is not run using a re-aligned R1 file.
- Summits are not called by MACS2 (as default at least).
- This is done in ENCODE, but we don't typically use them.
- Tool versions are not identical.
Additionally, there are a few "silent" differences:
- Pseudoreplicate generation was performed through a series of bash comands. This is now done through a python script.
- The method for determining conservative / optimal peak sets was coded in a new python script.
Part of the reason for converting this pipeline was to add additional features that were not present in the original pipeline. These features include:
- Support for building genome indices "on the fly"
- There is no need to have a separate
genome.tsvfile anymore. Just supply what you have with the proper parameters. If a genome index isn't provided, it is built at the start of the run.
- There is no need to have a separate
- Multiple conditions can be run at once.
- Additional bigwig normalized signal tracks are generated with
deepTools. - Trackhubs for UCSC Genome Browser can be generated.
- QC reporting is now done with MultiQC.
- Metagenomics section added to classify reads.
Check the documentation folder! This contains:
- rehash of the quickstart
- description of outputs
- comparison of commands between ENCODE and Nextflow