Abstract
Seeding heuristics are the most widely used strategies to speed up sequence alignment in bioinformatics. Such strategies are most successful if they are calibrated, so that the speed-versus-accuracy trade-off can be properly tuned. In the widely used case of read mapping, it has been so far impossible to predict the success rate of competing seeding strategies for lack of a theoretical framework. Here, we present an approach to estimate such quantities based on the theory of analytic combinatorics. The strategy is to specify a combinatorial construction of reads where the seeding heuristic fails, translate this specification into a generating function using formal rules, and finally extract the probabilities of interest from the singularities of the generating function. The generating function can also be used to set up a simple recurrence to compute the probabilities with greater precision. We use this approach to construct simple estimators of the success rate of the seeding heuristic under different types of sequencing errors, and we show that the estimates are accurate in practical situations. More generally, this work shows novel strategies based on analytic combinatorics to compute probabilities of interest in bioinformatics.
Highlights
Bioinformatics is going through a transition driven by the ongoing developments of high throughput sequencing [1,2]
We present the concepts of analytic combinatorics that are necessary to expose the main result regarding seeding probabilities in the read mapping problem
“atoms” with simple weighted generating functions, combine these atoms into objects of increasing complexity, construct their weighted generating functions from simple rules, and analyze the singularities of the weighted generating functions to approximate the quantities of interest
Summary
Bioinformatics is going through a transition driven by the ongoing developments of high throughput sequencing [1,2]. To cope with the surge of sequencing data, the bioinformatics community is under pressure to produce faster and more efficient algorithms. A common strategy to scale up analyses to large data sets is to use heuristics that are faster, but do not guarantee to return the optimal result. We present the concepts of analytic combinatorics that are necessary to expose the main result regarding seeding probabilities in the read mapping problem. The weighted generating function can be used to extract a exact recurrence equation that can be used to compute the probabilities of interest with higher accuracy. The central object of analytic combinatorics is the generating function [13]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have