SmsMap: mapping single molecule sequencing reads by locating the alignment starting positions

Ze-Gang Wei,Fei Liu,Shao-Wu Zhang

doi:10.1186/s12859-020-03698-w

Ze-Gang Wei, Fei Liu + Show 1 more

Open Access

https://doi.org/10.1186/s12859-020-03698-w

Copy DOI

Abstract

BackgroundSingle Molecule Sequencing (SMS) technology can produce longer reads with higher sequencing error rate. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. Most existing mapping tools generally adopt the traditional seed-and-extend strategy, and the candidate aligned regions for each query read are selected either by counting the number of matched seeds or chaining a group of seeds. However, for all the existing mapping tools, the coverage ratio of the alignment region to the query read is lower, and the read alignment quality and efficiency need to be improved. Here, we introduce smsMap, a novel mapping tool that is specifically designed to map the long reads of SMS to a reference genome.ResultssmsMap was evaluated with other existing seven SMS mapping tools (e.g., BLASR, minimap2, and BWA-MEM) on both simulated and real-life SMS datasets. The experimental results show that smsMap can efficiently achieve higher aligned read coverage ratio and has higher sensitivity that can align more sequences and bases to the reference genome. Additionally, smsMap is more robust to sequencing errors.ConclusionssmsMap is computationally efficient to align SMS reads, especially for the larger size of the reference genome (e.g., H. sapiens genome with over 3 billion base pairs). The source code of smsMap can be freely downloaded from https://github.com/NWPU-903PR/smsMap.

Highlights

Single Molecule Sequencing (SMS) technology can produce longer reads with higher sequencing error rate
Most mapping methods for SMS reads adopt the classical seed-and-extension methodology to obtain the alignment results. They first find the exactly matched seeds in the reference genome, select the candidate aligned region based on counting the number of matched seeds or chaining a group of seeds that are co-linear or close to each other (e.g., BLASR, LAMSA, GraphMap, NGMLR, Table 5 Running time and memory usage (GB) of each mapping method on three datasets smsMap BWA-MEM BLASR lordFAST minimap2 GraphMap* NGMLR
With the development of SMS technologies (e.g., PacBio and Oxford Nanopore MinION) that produce long but noisy reads, mapping these reads to the reference genome has become a central bioinformatics challenge

Summary

Introduction

Single Molecule Sequencing (SMS) technology can produce longer reads with higher sequencing error rate. A number of available methods (or tools) for mapping SMS long reads to the reference genome, such as BLASR [20], BWA-MEM [21], rHAT [22], GraphMap [23], LAMSA [24], minimap2 [25], NGMLR [26] and lordFAST [27], have been proposed. BLASR [20] is the first tool that is specially designed for mapping SMS reads It first builds a BWT-FM index [15, 16] of the genome to search exact matches and applies sparse dynamic programming (SDP) to generate rough alignments. LordFAST [27] first builds an index from the reference genome maps reads to the reference genome by extracting longest exact matches It selects candidate alignment regions, and gets the base-to-base alignment with dynamic programming

Methods

Results

Discussion

Conclusion