Abstract

The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.

Highlights

  • Maximal Exact Match (MEM) seeds yield fewer candidates and speed up the mapping process, but the question is at what cost? The theory developed here will allow us to compute the probability that a read has no MEM seed for the target and that the true location is missed at the seeding stage

  • The first is the probability that the read has no seed for the target, which we have already computed in section 3.4 using recursive expression (3)

  • We have devised a set of methods to compute the probability that seeding heuristics fail and commit the mapping process to an error

Read more

Summary

Mapping Reads to Genomes

Say we use an imperfect instrument to sequence a small fragment of DNA. If we know its genome of origin, how can we find the location of the fragment in this genome?. If the sequenced DNA fragment originates from a repetitive sequence, we can map the read only if we can rule out all the candidates except one. The difficulty of the true location problem is due to sequencing errors. The true location of a DNA fragment is not always the best (as measured by the identity between the sequence and the candidate location). This naturally leads to asking how we can identify the best location of the fragment in the genome. We will refer to this as the best location problem It amounts to finding the optimal alignment between two sequences, and for this reason has received substantial attention in bioinformatics.

Seeding Heuristics
The Two Types of Seeding Failure
Exact Seeds
Skip Seeds
MEM Seeds
Spaced Seeds
Sequencing Errors and Divergence of the Duplicates
Weighted Generating Functions
Analytic Representation
Example 1
Example 2
Example 3
OFF-TARGET EXACT SEEDS
The Dual Encoding
Transfer Matrix
Illustration
OFF-TARGET SKIP SEEDS
The Skip Dual Encoding
Hard and Soft Masking
The MEM Alphabet
Computing MEM Seeding Probabilities
Monte Carlo Sampling
Main Features of Sesame
Using Sesame to Compare Seeding
Key Insights About MEM Seeds
PRACTICAL CONSIDERATIONS
Realistic Sequencing Errors
Spurious Random Hits
Findings
DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call