Abstract

Genome signatures are statistical properties of DNA sequences that provide information on the underlying species. It is not understood, how such species-discriminating statistical properties arise from processes of genome evolution and from functional properties of the DNA. Investigating the interplay of different genome signatures can contribute to this understanding. Here we analyze the statistical dependences of two such genome signatures: word frequencies and symbol correlations at short and intermediate distances.We formulate a statistical model of word frequencies in DNA sequences based on the observed symbol correlations and show that deviations of word counts from this correlation-based null model serve as a new genome signature. This signature (i) performs better in sorting DNA sequence segments according to their species origin and (ii) reveals unexpected species differences in the composition of microsatellites, an important class of repetitive DNA.While the first observation is a typical task in metagenomics projects and therefore an important benchmark for a genome signature, the latter suggests strong species differences in the biological mechanisms of genome evolution.On a more general level, our results highlight that the choice of null model (here: word abundances computed via symbol correlations rather than shorter word counts) substantially affects the interpretation of such statistical signals.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.