This is the Nextflow pipeline to demultiplex PacBio HiFi data for the seqWell LongPlex Long Fragment Multiplexing Kit. The pipeline uses Lima for demultiplexing and uses longplexpy tools for data filtering. The pipeline is as shown in the image below. The pipeline starts with HiFi BAM files and has the following steps:
- The first Lima process,
LIMA_BOTH_END, demultiplexes reads using lima's neighbor option. This setting will demultiplex reads with both an i7 and i5 seqWell barcode sequence. - The
LIST_HYBRIDSandREMOVE_HYBRIDSprocesses identify and remove any reads with mismatched i7 and i5 seqWell barcode sequences in the remaining non-demultiplexed reads. - The second Lima process,
LIMA_EITHER_END, demultiplexes reads with only an i7 or i5 seqWell barcode sequence. - The BAM files for each sample within each pool are merged in the
MERGE_READSprocess and merged FASTQ files and bam files are created. - The
DEMUX_STATSprocess generates a summary of the demultiplexing steps. - If a
rename_mapis provided, theRENAME_DEMUX_STATSprocess renames the sample identifiers in the demultiplexing summary to match the user-defined sample names. NANOSTATandMULTIQCare used to generate summary metrics for the reads assigned to each sample in the pool.NANOSTAT_UNBARCODEDgenerates sequencing metrics for the unbarcoded reads remaining after both lima steps. Because the unbarcoded BAM is unaligned, reads are first converted to FASTQ via pysam before being passed to NanoStat.DEMUX_QCcombines lima barcode statistics, per-sample NanoStat results, and unbarcoded NanoStat results into two final output tables per pool: a per-well stats table and a per-pool summary table.
The final output from this pipeline includes Lima output files, demultiplexed BAM and FASTQ files, a demultiplexing summary, a MultiQC report collating NanoStat results, and comprehensive per-pool and per-well demux statistics.
This pipeline requires installation of Nextflow. It also requires installation of either a containerization platform such as Docker or a package manager such as conda/mamba.
All docker containers used in this pipeline are publicly available.
- lima: quay.io/biocontainers/lima:2.13.0--h9ee0642_0
- samtools: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
- longplexpy: seqwell/longplexpy:0.2.1
- picard: quay.io/biocontainers/picard:3.2.0--hdfd78af_0
- R: rocker/verse:4.3.1
- nanostat: quay.io/biocontainers/nanostat:1.6.0--pyhdfd78af_0
- multiqc: quay.io/biocontainers/multiqc:1.21--pyhdfd78af_0
- python: python:3.12-bookworm
- pandas: quay.io/biocontainers/pandas:1.5.2
The conda environment is defined in environment-pipeline.yml and will be built automatically if the pipeline is run with -profile conda. Note that this profile is only supported on Linux systems, as lima (v2.13.0) is only available for Linux.
The required parameters are pool_sheet and output.
pool_sheet is the path to a CSV file.
There are four required columns:
- pool_ID: Identifier to be used in naming output files.
Must contain only letters and numbers in
pool_ID. Please avoid having underscore (_), dash (-), and dot(.) characters in thepool_ID. - pool_path: Path to PacBio HiFi BAM file for this pool. This path can be a local absolute path or an AWS S3 URI. If it is an AWS S3 URI, please make sure to set your security credentials appropriately.
- i7_barcode and i5_barcode: Path to the appropriate barcodes in FASTA format.
Default barcodes are found in
barcodes/. For early access users, please use barcode set labelledset3. Please use barcode set labelledset1if you bought kits after product launch.
The output directory path can be a local absolute path or an AWS S3 URI. If it is an AWS S3 URI, please make sure to set your security credentials appropriately.
rename_map is the path to a CSV file used to rename output BAM and FASTQ files, as well as the sample identifiers in the demultiplexing summary and the DEMUX_QC output tables.
If not provided, output files and all summary tables will use pool_ID.well_ID as the default sample identifier.
There are two required columns:
- pool_ID.well_ID: The default sample identifier in the format
pool_ID.well_ID(e.g.bc1015.A01). The formatting is strict — the pool ID and well ID must be joined with a.(not_or any other character). The well ID must follow the format of a letterA–Hfollowed by a two-digit number (e.g. A01, B12); single-digit row numbers must be zero-padded (e.g.A1is invalid, useA01). - sample_ID: The desired output sample name (e.g.
bc1015.sample1). Unlikepool_ID.well_ID, underscores (_) are accepted as connectors within the sample name (e.g.bc1015_sample1is also valid).
Example (tests/sample_map.csv):
| pool_ID.well_ID | sample_ID |
|---|---|
| bc1015.A01 | bc1015.sample1 |
| bc1015.A02 | bc1015.sample2 |
| bc1015.A03 | bc1015.sample3 |
| bc1015.B01 | bc1015.sample4 |
| bc1015.B02 | bc1015.sample5 |
| bc1015.B03 | bc1015.sample6 |
| bc1015.C01 | bc1015.sample7 |
When rename_map is provided:
- The
RENAME_DEMUX_STATSprocess produces a renamed version of the demultiplexing summary CSV with the user-defined sample names applied. - The
DEMUX_QCprocess uses the map to populate theSample_Namecolumn in the per-well stats table. TheBarcodecolumn always retains the originalpool_ID.well_IDkey (e.g.bc1015.A01) regardless of renaming. - When multiple pools are present in the
pool_sheet, therename_mapmay contain entries for all pools. Each pool'sDEMUX_QCrun will automatically filter the map to only its own entries using thepool_IDprefix, ensuring no cross-pool mixing.
Several profiles are available and can be selected with the -profile option at the command line.
apptainerawscondadockersingularity
A minimal execution might look like:
nextflow run \
-profile docker \
main.nf \
--pool_sheet "${PWD}/path/to/pool_sheet.csv" \
--output "${PWD}/path/to/output"The pipeline can be run using included test data without BAM and FASTQ file renaming:
nextflow run \
-profile docker \
main.nf \
-c nextflow.config \
--pool_sheet "${PWD}/tests/pool_sheet.csv" \
--output "${PWD}/test_output" \
-with-report \
-with-trace \
-resume -bgThe pipeline can be run using included test data with BAM and FASTQ file renaming:
nextflow run \
-profile docker \
main.nf \
-c nextflow.config \
--pool_sheet "${PWD}/tests/pool_sheet.csv" \
--output "${PWD}/test_output_renamed" \
--rename_map "${PWD}/tests/sample_map.csv" \
-with-report \
-with-trace \
-resume -bgnextflow run \
-profile conda \
main.nf \
-c nextflow.config \
--pool_sheet "${PWD}/tests/pool_sheet.csv" \
--output "${PWD}/test_output" \
-with-report \
-with-trace \
-resume -bgtest_output/
├── bc1015/
│ ├── demux_summary/
│ │ ├── bc1015_demux_report.csv # Summary of demultiplexing results
│ │ └── bc1015_demux_report_renamed.csv # Renamed summary (only present if --rename_map is provided)
│ ├── hybrids/
│ │ ├── bc1015.hybrid_list.txt # List of reads with mismatched i5 & i7 barcode sequences
│ │ └── bc1015.unbarcoded.filtered.bam # Reads that did not demultiplex in step LIMA_BOTH_ENDS with hybrid reads removed
│ ├── lima_out/
│ │ ├── demux_either_i7_i5/ # Demultiplexing results using a single barcode
│ │ │ ├── bc1015.[BARCODE_ID]--[BARCODE_ID].bam # Reads demultiplexed based on a single barcode
│ │ │ ├── ...
│ │ │ ├── bc1015.unbarcoded.bam # Reads that failed to demultiplex
│ │ │ ├── i7_5_bc1015.lima.counts # Counts of each observed barcode
│ │ │ └── i7_5_bc1015.lima.summary # Summary of lima read filtering results
│ │ └── demux_i7_i5/ # Demultiplexing results using i5 and i7 sequences
│ │ ├── bc1015.lima.report # lima findings for every read
│ │ ├── bc1015.[P5_BARCODE_ID]--[P7_BARCODE_ID].bam # Reads demultiplexed based on matching i5 and i7 sequences
│ │ ├── ...
│ │ ├── bc1015.unbarcoded.bam # Reads that did not demultiplex in the first lima process
│ │ ├── i7_i5_bc1015.lima.counts # Counts of each observed barcode
│ │ └── i7_i5_bc1015.lima.summary # Summary of lima read filtering results
│ ├── merged_bam/
│ │ ├── bc1015.[BARCODE_WELL/sample_ID].bam # Merged BAM file for specific barcode well; sample_ID is used if rename_map is provided, otherwise barcode_well is used (e.g. bc1015.A01)
│ │ └── ...
│ ├── merged_fastq/
│ │ ├── bc1015.[BARCODE_WELL/sample_ID].fastq.gz # Merged FASTQ file for specific barcode well; sample_ID is used if rename_map is provided, otherwise barcode_well is used (e.g. bc1015.A01)
│ │ └── ...
│ └── demux_qc/
│ │ ├── bc1015_per_barcode_qc_report.csv # Per-barcode QC report for pool bc1015
│ │ └── bc1015_per_pool_qc_report.csv # Per-pool QC report for pool bc1015
| └── multiqc/
| └── bc1015_multiqc_report.html # MultiQC report including NanoStat results
└── logs/
├── execution_report_[DATE-TIME-STAMP].html # Nextflow execution report
├── execution_timeline_[DATE-TIME-STAMP].html # Nextflow execution timeline
├── execution_trace_[DATE-TIME-STAMP].txt # Nextflow execution trace
└── pipeline_dag_[DATE-TIME-STAMP].html # Nextflow pipeline DAG
One row per well.
| Column | Description |
|---|---|
Sample_Name |
User-defined sample name from rename_map, or pool_ID.well_ID if not provided |
Barcode |
Original pool_ID.well_ID key (e.g. bc1015.A01): always the well identifier regardless of renaming |
Barcode_Quality |
Mean ScoreCombined from both .lima.report files across all reads assigned to this well |
HiFi_Reads_count |
Total reads assigned to this well: both-end reads plus either-end reads (P5-only and P7-only rows summed per well) |
Mean_HiFi_Read_Length |
Mean read length from NanoStat on the merged BAM for this well |
Median_HiFi_Read_Quality |
Median read quality (QV) from NanoStat on the merged BAM for this well |
HiFi_Yield |
Total bases from NanoStat on the merged BAM for this well |
One summary table per pool covering the full run.
| Metric | Description |
|---|---|
| Unique Barcodes | Number of wells with assigned reads and non-zero yield |
| Barcoded HiFi Reads | Total reads assigned to any barcode across both lima steps |
| Unbarcoded HiFi Reads | Reads not assigned after both lima steps (from lima counts) |
| Barcoded HiFi Reads (%) | Fraction of total reads that are barcoded |
| Barcoded HiFi Yield (Gb) | Total bases across all barcoded wells |
| Unbarcoded HiFi Yield (Gb) | Total bases in the unbarcoded BAM from NANOSTAT_UNBARCODED |
| Barcoded HiFi Yield (%) | Fraction of total yield that is barcoded |
| Mean HiFi Reads per Barcode | Mean read count across all wells |
| Max HiFi Reads per Barcode | Highest read count across all wells |
| Min HiFi Reads per Barcode | Lowest read count across all wells |
| Barcoded HiFi Read Length (mean, kb) | Weighted mean read length across all barcoded wells |
| Unbarcoded HiFi Read Length (mean, kb) | Mean read length from NANOSTAT_UNBARCODED |
