Abstract
Motivation: The explosive growth of next-generation sequencing datasets poses a challenge to the mapping of reads to reference genomes in terms of alignment quality and execution speed. With the continuing progress of high-throughput sequencing technologies, read length is constantly increasing and many existing aligners are becoming inefficient as generated reads grow larger.Results: We present CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. Our aligner is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments. We have evaluated and compared CUSHAW2 to the three other long read aligners BWA-SW, Bowtie2 and GASSST, by aligning simulated and real datasets to the human genome. The performance evaluation shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment, while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads.Availability: CUSHAW2, written in C++, and all simulated datasets are available at http://cushaw2.sourceforge.netContact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.deSupplementary information: Supplementary data are available at Bioinformatics online.
Highlights
Many biological applications of next-generation sequencing (NGS) require the alignment of large quantities of produced reads to a given reference genome
We have presented CUSHAW2, a parallel and accurate algorithm and tool for aligning long reads to large genomes, such as the human genome
maximal exact matches (MEMs) are used as seeds to find gapped alignments and final alignments are reported in SAM format (Li et al, 2009) to facilitate the downstream analysis
Summary
Many biological applications of next-generation sequencing (NGS) require the alignment of large quantities of produced reads to a given reference genome. A wide variety of short read aligners have been developed in recent years. They can be classified into two categories according to their approaches to identify seeds: hash tables and prefix/suffix tries. Many existing short read aligners are becoming inefficient as generated reads grow to a few hundred bp in length because of two reasons. They typically perform only ungapped alignments or gapped alignments allowing a very limited number of gaps (typically one gap). These new features of long read alignment motivate the design of new aligners with fast speed and high quality
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have