A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Chirag Jain,Sergey Koren,Alexander Dilthey,Srinivas Aluru,Adam M Phillippy

doi:10.1089/cmb.2018.0036

Abstract

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Abstract

Talk to us

Similar Papers

More From: Journal of Computational Biology

Lead the way for us

Journal: Journal of Computational Biology	Publication Date: Apr 30, 2018
Citations: 62

Similar Papers

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases
Chirag Jain ... Srinivas Aluru
-
Chirag Jain, et. al.Chirag Jain ... Srinivas Aluru
01 Jan 2017
01 Jan 2017

GEOMETRIC ALGORITHMS FOR DENSITY-BASED DATA CLUSTERING
Danny Z Chen ... Michiel Smid
International Journal of Computational Geometry & Applications | VOL. 15
Danny Z Chen, et. al.Danny Z Chen ... Michiel Smid
01 Jun 2005
International Journal of Computational Geometry & Applications | VOL. 15

LordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data.
Ehsan Haghshenas ... S Cenk Sahinalp
Bioinformatics | VOL. 35
Ehsan Haghshenas, et. al.Ehsan Haghshenas ... S Cenk Sahinalp
02 Jul 2018
Bioinformatics | VOL. 35

Geometric Algorithms for Density-Based Data Clustering
Danny Z Chen ... Bin Xu
-
Danny Z Chen, et. al.Danny Z Chen ... Bin Xu
01 Jan 2002
01 Jan 2002

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.

Abstract

Talk to us

Similar Papers

More From: Journal of Computational Biology