Calibrating Seed-Based Heuristics to Map Short Reads With Sesame.

Guillaume J Filion,Eduard Zorita,Ruggero Cortini

doi:10.3389/fgene.2020.00572

Abstract

The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.

Highlights

Maximal Exact Match (MEM) seeds yield fewer candidates and speed up the mapping process, but the question is at what cost? The theory developed here will allow us to compute the probability that a read has no MEM seed for the target and that the true location is missed at the seeding stage
The first is the probability that the read has no seed for the target, which we have already computed in section 3.4 using recursive expression (3)
We have devised a set of methods to compute the probability that seeding heuristics fail and commit the mapping process to an error

Summary

Mapping Reads to Genomes

Say we use an imperfect instrument to sequence a small fragment of DNA. If we know its genome of origin, how can we find the location of the fragment in this genome?. If the sequenced DNA fragment originates from a repetitive sequence, we can map the read only if we can rule out all the candidates except one. The difficulty of the true location problem is due to sequencing errors. The true location of a DNA fragment is not always the best (as measured by the identity between the sequence and the candidate location). This naturally leads to asking how we can identify the best location of the fragment in the genome. We will refer to this as the best location problem It amounts to finding the optimal alignment between two sequences, and for this reason has received substantial attention in bioinformatics.

Seeding Heuristics

The Two Types of Seeding Failure

Exact Seeds

Skip Seeds

MEM Seeds

Spaced Seeds

Sequencing Errors and Divergence of the Duplicates

Weighted Generating Functions

Analytic Representation

Example 1

Example 2

Example 3

OFF-TARGET EXACT SEEDS

The Dual Encoding

Transfer Matrix

Illustration

OFF-TARGET SKIP SEEDS

The Skip Dual Encoding

Hard and Soft Masking

The MEM Alphabet

Computing MEM Seeding Probabilities

Monte Carlo Sampling

Main Features of Sesame

Using Sesame to Compare Seeding

Key Insights About MEM Seeds

PRACTICAL CONSIDERATIONS

Realistic Sequencing Errors

Spurious Random Hits

Findings

DISCUSSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Genetics	Publication Date: Jun 25, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

RF: A method for filtering short reads with tandem repeats for genome mapping
Kazuharu Misawa
Genomics | VOL. 102
Kazuharu MisawaKazuharu Misawa
29 Mar 2013
Genomics | VOL. 102

Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads.
Ryan R Wick ... Adam M Phillippy
PLoS computational biology | VOL. 13
Ryan R Wick, et. al.Ryan R Wick ... Adam M Phillippy
08 Jun 2017
PLoS computational biology | VOL. 13

Short read fragment assembly of bacterial genomes
Mark J Chaisson ... Pavel A Pevzner
Genome research | VOL. 18
Mark J Chaisson, et. al.Mark J Chaisson ... Pavel A Pevzner
14 Dec 2007
Genome research | VOL. 18

Abstract LB080: SAVANA: a computational method to characterize structural variation in human cancer genomes using nanopore sequencing
Hillary Elrick ... Adrienne M Flanagan
American Journal of Cancer | VOL. 83
Hillary Elrick, et. al.Hillary Elrick ... Adrienne M Flanagan
14 Apr 2023
American Journal of Cancer | VOL. 83

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics