Long read alignment based on maximal exact match seeds

Yongchao Liu,Bertil Schmidt

doi:10.1093/bioinformatics/bts414

Yongchao Liu, Bertil Schmidt

Open Access

PDF Available

https://doi.org/10.1093/bioinformatics/bts414

Copy DOI

Export

Save

Cite

Journal: Bioinformatics	Publication Date: Sep 3, 2012
Citations: 94	License type: CC BY 3.0

Affiliation: Johannes Gutenberg University Mainz

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Motivation: The explosive growth of next-generation sequencing datasets poses a challenge to the mapping of reads to reference genomes in terms of alignment quality and execution speed. With the continuing progress of high-throughput sequencing technologies, read length is constantly increasing and many existing aligners are becoming inefficient as generated reads grow larger.Results: We present CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. Our aligner is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments. We have evaluated and compared CUSHAW2 to the three other long read aligners BWA-SW, Bowtie2 and GASSST, by aligning simulated and real datasets to the human genome. The performance evaluation shows that CUSHAW2 is consistently among the highest-ranked aligners in terms of alignment quality for both single-end and paired-end alignment, while demonstrating highly competitive speed. Furthermore, our aligner shows good parallel scalability with respect to the number of CPU threads.Availability: CUSHAW2, written in C++, and all simulated datasets are available at http://cushaw2.sourceforge.netContact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.deSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

Many biological applications of next-generation sequencing (NGS) require the alignment of large quantities of produced reads to a given reference genome
We have presented CUSHAW2, a parallel and accurate algorithm and tool for aligning long reads to large genomes, such as the human genome
maximal exact matches (MEMs) are used as seeds to find gapped alignments and final alignments are reported in SAM format (Li et al, 2009) to facilitate the downstream analysis

Summary

Introduction

Many biological applications of next-generation sequencing (NGS) require the alignment of large quantities of produced reads to a given reference genome. A wide variety of short read aligners have been developed in recent years. They can be classified into two categories according to their approaches to identify seeds: hash tables and prefix/suffix tries. Many existing short read aligners are becoming inefficient as generated reads grow to a few hundred bp in length because of two reasons. They typically perform only ungapped alignments or gapped alignments allowing a very limited number of gaps (typically one gap). These new features of long read alignment motivate the design of new aligners with fast speed and high quality

Methods

Results

Conclusion