Large scale sequence alignment via efficient inference in generative models

Mihir Mongia,Chengze Shen,Hosein Mohimani,Arash Gholami Davoodi,Guillaume Marçais

doi:10.1038/s41598-023-34257-x

Abstract

Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Large scale sequence alignment via efficient inference in generative models

Abstract

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Journal: Scientific Reports	Publication Date: May 4, 2023
License type: open-access

Similar Papers

Exact Calculation of Distributions on Integers, with Application to Sequence Alignment
Lee A Newberg ... Charles E Lawrence
Journal of Computational Biology | VOL. 16
Lee A Newberg, et. al.Lee A Newberg ... Charles E Lawrence
01 Jan 2009
Journal of Computational Biology | VOL. 16

Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment.
Veska Gancheva ... Hristo Stoev
Genes | VOL. 15
Veska Gancheva, et. al.Veska Gancheva ... Hristo Stoev
07 Mar 2024
Genes | VOL. 15

What's behind bioinformatics?
Lorraine K Tanabe
Trends in Biotechnology | VOL. 19
Lorraine K TanabeLorraine K Tanabe
26 Jan 2001
Trends in Biotechnology | VOL. 19

Learning Deep Mixtures of Gaussian Process Experts Using Sum-Product Networks
...
-
, et. al. ...
12 Sep 2018
12 Sep 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Large scale sequence alignment via efficient inference in generative models

Abstract

Talk to us

Similar Papers

More From: Scientific Reports