Ultraplex: A rapid, flexible, all-in-one fastq demultiplexer.

Oscar G Wilkins,Charlotte Capitanchik,Jernej Ule,Nicholas M Luscombe

doi:10.12688/wellcomeopenres.16791.1

Abstract

Background: The first step of virtually all next generation sequencing analysis involves the splitting of the raw sequencing data into separate files using sample-specific barcodes, a process known as "demultiplexing". However, we found that existing software for this purpose was either too inflexible or too computationally intensive for fast, streamlined processing of raw, single end fastq files containing combinatorial barcodes. Results: Here, we introduce a fast and uniquely flexible demultiplexer, named Ultraplex, which splits a raw FASTQ file containing barcodes either at a single end or at both 5' and 3' ends of reads, trims the sequencing adaptors and low-quality bases, and moves unique molecular identifiers (UMIs) into the read header, allowing subsequent removal of PCR duplicates. Ultraplex is able to perform such single or combinatorial demultiplexing on both single- and paired-end sequencing data, and can process an entire Illumina HiSeq lane, consisting of nearly 500 million reads, in less than 20 minutes. Conclusions: Ultraplex greatly reduces computational burden and pipeline complexity for the demultiplexing of complex sequencing libraries, such as those produced by various CLIP and ribosome profiling protocols, and is also very user friendly, enabling streamlined, robust data processing. Ultraplex is available on PyPi and Conda and via Github.

Highlights

Generation sequencing (NGS) has greatly reduced the cost of obtaining large amounts of sequence data, as hundreds of millions, or even billions, of reads can be generated in a single sequencing run (Goodwin et al, 2016)
We required fully multithreaded operation, to take advantage of modern CPU architectures, and all processing to be performed in a single read-write cycle, so as to avoid read/ write bottlenecks. By testing it on iCLIP libraries, we demonstrated that the resulting software, Ultraplex, meets all of these requirements
Our testing was run on a high-performance computing cluster where each CPU node is an 8-core Intel E5-2640 Haswell CPU running at 2.6GHz, with hyperthreading enabled, running Linux 3.10.0–957.1.3.el7.x86_64. iCount was run with additional flags --min_adapter_overlap 3 -mis 1 -ml 0 and Ultraplex with -mt 3 -m5 1 -q 0 -l 17

Summary

Introduction

Generation sequencing (NGS) has greatly reduced the cost of obtaining large amounts of sequence data, as hundreds of millions, or even billions, of reads can be generated in a single sequencing run (Goodwin et al, 2016). We found that existing software for this purpose was either too inflexible or too computationally intensive for fast, streamlined processing of raw, single end fastq files containing combinatorial barcodes. Results: Here, we introduce a fast and uniquely flexible demultiplexer, named Ultraplex, which splits a raw FASTQ file containing barcodes either at a single end or at both 5’ and 3’ ends of reads, trims the sequencing adaptors and low-quality bases, and moves unique molecular identifiers (UMIs) into the read header, allowing subsequent removal of PCR duplicates. Conclusions: Ultraplex greatly reduces computational burden and pipeline complexity for the demultiplexing of complex sequencing libraries, such as those produced by various CLIP and ribosome profiling protocols, and is very user friendly, enabling streamlined, robust data processing.

Methods

Results

Conclusion