Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes.

Mireille Régnier,Philippe Chassignet

doi:10.3389/fbioe.2016.00035

Abstract

Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences.

Highlights

This paper provides combinatorial tools to distinguish biologically significant events from random repetitions in sequences
This paper describes the behavior of the number of unique or repeated k-mers in a random sequence, on a general alphabet
Derivation relies on a combination of analytic combinatorics and on Lagrange multipliers

Summary

Introduction

This paper provides combinatorial tools to distinguish biologically significant events from random repetitions in sequences This is a key issue in several genomic problems as many repetitive structures can be found in genomes. Knowledge about the length of a maximal repeat is a key issue for assembly, notably the design of algorithms that rely upon de Bruijn graphs. In resequencing, it is a common assumption for aligners that any sequenced “read” should map to a single position in a genome: in the ideal case where no sequencing error occurs, this implies that the length of the reads is larger than the length of the maximal repetition. Heuristics have been proposed and implemented (Devillers and Schbath, 2012; Rizk et al, 2013; Chikhi and Medvedev, 2014)

Methods

Results

Conclusion