A genome signature derived from the interplay of word frequencies and symbol correlations

Simon Möller,Heike Hameister,Marc-Thorsten Hütt

doi:10.1016/j.physa.2014.07.048

Abstract

Genome signatures are statistical properties of DNA sequences that provide information on the underlying species. It is not understood, how such species-discriminating statistical properties arise from processes of genome evolution and from functional properties of the DNA. Investigating the interplay of different genome signatures can contribute to this understanding. Here we analyze the statistical dependences of two such genome signatures: word frequencies and symbol correlations at short and intermediate distances.We formulate a statistical model of word frequencies in DNA sequences based on the observed symbol correlations and show that deviations of word counts from this correlation-based null model serve as a new genome signature. This signature (i) performs better in sorting DNA sequence segments according to their species origin and (ii) reveals unexpected species differences in the composition of microsatellites, an important class of repetitive DNA.While the first observation is a typical task in metagenomics projects and therefore an important benchmark for a genome signature, the latter suggests strong species differences in the biological mechanisms of genome evolution.On a more general level, our results highlight that the choice of null model (here: word abundances computed via symbol correlations rather than shorter word counts) substantially affects the interpretation of such statistical signals.

Full Text