Abstract
Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences.
Full Text
Topics from this Paper
Large Scale Sequence Alignment
Independent Probabilities
Efficient Inference
Database Size
Inference Problem
+ Show 5 more
Create a personalized feed of these topics
Get StartedSimilar Papers
Journal of Computational Biology
Jan 1, 2009
arXiv: Learning
Sep 12, 2018
PLoS Computational Biology
May 16, 2008
Journal of Computational Biology
Jul 1, 2022
Jan 1, 1999
Chemical Engineering Journal
Nov 1, 2021
May 1, 2014
Jan 1, 2006
arXiv: Machine Learning
Feb 3, 2016
Machine Learning
Oct 1, 2015
Scientific Reports
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023
Scientific Reports
Sep 20, 2023