Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics.

S Havlin,C.-K Peng,R N Mantegna,A L Goldberger,S V Buldyrev,H E Stanley,M Simons

doi:10.1103/physreve.52.2939

Abstract

We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C. elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of coding regions. In particular, (i) a n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger "n-gram redundancy") than the coding regions. In contrast to the three chromosomes, we find that for vertebrates such as primates and rodents and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n-gram redundancy analysis, we also briefly discuss the failure of the zeroth- and first-order Markovian models or simple nucleotide repeats to account fully for these "linguistic" features of DNA. Finally, we emphasize that our results by no means prove the existence of a "language" in noncoding DNA.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics.

Abstract

Talk to us

Similar Papers

More From: Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics

Lead the way for us

Journal: Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics	Publication Date: Sep 1, 1995
Citations: 110

Similar Papers

Time-dependent ARMA modeling of genomic sequences
Jerzy S Zielinski ... Dan Schonfeld
BMC Bioinformatics | VOL. 9
Jerzy S Zielinski, et. al.Jerzy S Zielinski ... Dan Schonfeld
01 Aug 2008
BMC Bioinformatics | VOL. 9

Clustering of Identical Oligomers in Coding and Noncoding DNA Sequences
H Eugene Stanley ... Nikolay V Dokholyan
Journal of Biomolecular Structure and Dynamics | VOL. 17
H Eugene Stanley, et. al.H Eugene Stanley ... Nikolay V Dokholyan
01 Aug 1999
Journal of Biomolecular Structure and Dynamics | VOL. 17

Expansion of tandem repeats and oligomer clustering in coding and noncoding DNA sequences
H.Eugene Stanley ... Rachel H.R Stanley
Physica A: Statistical Mechanics and its Applications | VOL. 273
H.Eugene Stanley, et. al.H.Eugene Stanley ... Rachel H.R Stanley
01 Nov 1999
Physica A: Statistical Mechanics and its Applications | VOL. 273

C.U.R.R.F. (Codon Usage regarding Restriction Finder): A Free Java®-Based Tool to Detect Potential Restriction Sites in Both Coding and Non-Coding DNA Sequences
Michael Gatter ... Falk Matthäus
Molecular Biotechnology | VOL. 52
Michael Gatter, et. al.Michael Gatter ... Falk Matthäus
13 Dec 2011
Molecular Biotechnology | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics.

Abstract

Talk to us

Similar Papers

More From: Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics