Abstract

Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.

Highlights

  • The development of next-generation sequencing (NGS) technologies has led to a rapid decline in the sequencing cost and had a tremendous impact on genomic research (Morozova and Marra, 2008; Reinert et al, 2015)

  • maximal approximate matches (MAMs) is distributed under the GNU General Public License (GPL)

  • All aligners were tested on two simulated datasets and two high-throughput sequencing (HTS) datasets to assess their speed, sensitivity, and accuracy

Read more

Summary

Introduction

The development of next-generation sequencing (NGS) technologies has led to a rapid decline in the sequencing cost and had a tremendous impact on genomic research (Morozova and Marra, 2008; Reinert et al, 2015). There has been an intense effort in recent years to develop computational methods and applications to meet the increasing demands for sequencing data analysis (Flicek and Birney, 2009). One of these fundamental tasks is sequence alignment. Many alignment methods have been proposed to improve the efficiency and accuracy of sequence alignment, including but not limited to Maq (Li et al, 2008a), SOAP (Li et al, 2008b), Bowtie (Langmead et al, 2009), BWA (Li and Durbin, 2009), and mrsFAST (Hach et al, 2010). Aligning repetitive DNA sequences accurately to the reference genome remains a major issue

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call