NF-ENCODE-ATAC

This is a nextflow-based pipeline designed to process ATAC-seq data based on the ENCODE's ATAC-seq pipeline. It attempts to replicate the commands that would normally be processed by ENCODE, but in a Nextflow-native format.

Please see the later section for more details.

Citation / Credits

Please be sure you cite ENCODE's ATAC-seq pipeline if you use this pipeline:

ENCODE-DCC/atac-seq-pipeline:2.0.0

Additionally, a select set of nf-core modules are used in this pipeline. Please be sure to cite these as well:

nf-core

Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

Nextflow

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

Quick Start

Prerequisites

To use this workflow, you MUST have:

Nextflow

Additionally, it is HIGHLY recommended that you have one of:

Installation

The pipeline can be installed using the following command:

nextflow pull WeirauchLab/nf-encode-atac

If that fails, you can manually install the pipeline by cloning the repository:

git clone WeirauchLab/nf-encode-atac

Prepare samplesheet

The pipeline requires a samplesheet in CSV format. The samplesheet should have a combination of the following columns:

The samplesheet is a CSV file that contains the following columns:

Column	Required	Description
id	Yes	The sample ID
group	Yes	A group name. Anything matching this gets treated as a replicate
control_sample_id	No	The ID of the control sample.
control_group_id	No	The ID of the control group
fastq_1	Yes	The path to the first fastq file
fastq_2	No	The path to the second fastq file
adapter_1	No	adapter sequence to trim for read 1. automatic if not supplied
adapter_2	No	adapter sequence to trim for read 2. automatic if not supplied

An example samplesheet is shown below:

id,group,fastq_1,fastq_2
example1,example,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
example2,example,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz
example2,example,/path/to/example2_R1_2.fastq.gz,/path/to/example2_R2_2.fastq.gz

If the fastq files for a sample are split across multiple files, you can specify multiple rows for the same sample ID. They will be merged together (see "example2" in the csv table above).

A note about adapters:

If you do not specify an adapter, the pipeline will let fastp attempt to automatically detect the adapter sequence. If you do specify an adapter, the pipeline will use that adapter sequence for trimming. There are two params that can be used as well, adapter_1 and adapter_2, which can be used to globally specify adapter sequences. If these are set, any adapter sequences NOT specified in the samplesheet will use these values.

About control groups

Control samples are typically IgG or input samples. This pipeline can handle this on a per-sample or per-group basis. To implement this, you can specify either the control_sample_id or control_group_id column.

Specify a sample ID that is a paired control

id,control_sample_id,group,fastq_1,fastq_2
target1,input1,example,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
target2,input2,example,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
input1,,example,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz
input2,,example,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz

If a control_sample_id is specified, the control_group_id column will be filled in during the pipeline.

Specify a group name that is a control group

id,control_group_id,group,fastq_1,fastq_2
target1,input,target,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
target2,input,target,/path/to/example1_R1.fastq.gz,/path/to/example1_R2.fastq.gz
input1,,input,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz
input2,,input,/path/to/example2_R1_1.fastq.gz,/path/to/example2_R2_1.fastq.gz

When doing it this way, the pooled control group will be used for all samples in the group.

Locate reference genome files

At minimum, you need the following:

reference genome in fasta format

It is recommended that you also have:

GTF annotation file
region exclusion bed file (if applicable)

Create a parameters file

This is technically optional, but it is highly recommended. A parameters file can be in either JSON or YAML format. This file should contain the settings for the pipeline in a key-value format. All available parameters can be found in the nextflow.config file and additional validation information can be found in nextflow_schema.json.

A basic example may look like:

{
  "input": "samplesheet.csv",
  "outdir": "results",
  "fasta": "/path/to/genome.fa",
  "gtf": "/path/to/annotation.gtf"
}

Run the pipeline

To run the pipeline, you can use the following command:

# Basic command
nextflow run WeirauchLab/nf-encode-atac -params-file params.json

# Use profiles to specify execution profiles. Here, docker is used.
nextflow run WeirauchLab/nf-encode-atac -profile docker -params-file params.json

If all goes well, you should see the pipeline start processing your data. Pipelines can also be resumed if needed by adding the -resume flag. This will require that the workDir directory is still present.

Pipeline Details

From ENCODE's WDL to Nextflow

This pipeline is specifically based on ENCODE-DCC/atac-seq-pipeline:2.0.0. The pipeline is designed to replicate the commands that would normally be processed by ENCODE, but in a Nextflow-native format. This was done by looking through the repository, dissecting the commands, and converting them to Nextflow processes. Several steps were validated by looking at the scripts run by Cromwell.

There are a few minor differences:

SPP peak calling is not included.
SPP's fragment estimation is performed on the full library, not just a subsampled tagAlign.
SPP is not run using a re-aligned R1 file.
Summits are not called by MACS2 (as default at least).
- This is done in ENCODE, but we don't typically use them.
Tool versions are not identical.

Additionally, there are a few "silent" differences:

Pseudoreplicate generation was performed through a series of bash comands. This is now done through a python script.
The method for determining conservative / optimal peak sets was coded in a new python script.

Additional Features

Part of the reason for converting this pipeline was to add additional features that were not present in the original pipeline. These features include:

Support for building genome indices "on the fly"
- There is no need to have a separate genome.tsv file anymore. Just supply what you have with the proper parameters. If a genome index isn't provided, it is built at the start of the run.
Multiple conditions can be run at once.
Additional bigwig normalized signal tracks are generated with deepTools.
Trackhubs for UCSC Genome Browser can be generated.
QC reporting is now done with MultiQC.
Metagenomics section added to classify reads.

FAQ

Where can I find more information?

Check the documentation folder! This contains:

rehash of the quickstart
description of outputs
comparison of commands between ENCODE and Nextflow

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github		.github
assets		assets
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NF-ENCODE-ATAC

Citation / Credits

Quick Start

Prerequisites

Installation

Prepare samplesheet

About control groups

Locate reference genome files

Create a parameters file

Run the pipeline

Pipeline Details

From ENCODE's WDL to Nextflow

Additional Features

FAQ

Where can I find more information?

About

Uh oh!

Releases 19

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NF-ENCODE-ATAC

Citation / Credits

Quick Start

Prerequisites

Installation

Prepare samplesheet

About control groups

Locate reference genome files

Create a parameters file

Run the pipeline

Pipeline Details

From ENCODE's WDL to Nextflow

Additional Features

FAQ

Where can I find more information?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages