Abstract

Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences.

Highlights

  • This paper provides combinatorial tools to distinguish biologically significant events from random repetitions in sequences

  • This paper describes the behavior of the number of unique or repeated k-mers in a random sequence, on a general alphabet

  • Derivation relies on a combination of analytic combinatorics and on Lagrange multipliers

Read more

Summary

Introduction

This paper provides combinatorial tools to distinguish biologically significant events from random repetitions in sequences This is a key issue in several genomic problems as many repetitive structures can be found in genomes. Knowledge about the length of a maximal repeat is a key issue for assembly, notably the design of algorithms that rely upon de Bruijn graphs. In resequencing, it is a common assumption for aligners that any sequenced “read” should map to a single position in a genome: in the ideal case where no sequencing error occurs, this implies that the length of the reads is larger than the length of the maximal repetition. Heuristics have been proposed and implemented (Devillers and Schbath, 2012; Rizk et al, 2013; Chikhi and Medvedev, 2014)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call