Statistical scales of order in DNA

Douglas Poland

doi:10.1016/j.bpc.2009.02.003

Abstract

In the present paper we examine the statistics of occurrence of A–T and C–G base pairs in DNA. We focus on the net base composition in blocks of base pairs of various sizes. This paper extends our previous work on randomness and order in DNA sequences and examines order on various scales. For structure on the local scale (10 0–10 1 bp) we have seen that the net base composition in given block sizes is fitted very accurately by the discrete binomial distribution for a random system. If the statistics were random for larger block sizes then the appropriate distribution would be the standard normal (Gaussian) distribution which is the continuous analog of the discrete binomial distribution. However, we have found that at the intermediate scale (10 2–10 4 bp) the composition distribution is not fit by a standard normal distribution but rather by a modified normal distribution with a standard deviation that is a marked nonrandom function of block size. In particular, the standard deviation accurately follows a power law with a characteristic exponent. This behavior can be interpreted in terms of a random walk model due to Mandelbrot that is characterized by a tendency for the walk to persist in direction. The DNA analog of the walk model is the tendency of blocks of base pairs with a given net composition to be followed by blocks of a similar composition (persistence of composition). A model based on a generating function constructed from a matrix of conditional probabilities (incorporating persistence) explains the overall order in a given genome at the intermediate scale. In the present paper we examine the block statistics in DNA using the genomes of two organisms, namely Bacillus anthracis and Escherichia coli both of which have a chain length of slightly over five million base pairs. We find that the distributions in B. anthracis are well fit by a Mandelbrot-like distribution. On the other hand, the distributions in E. coli are not so well fit by this distribution which is based on two moments. Using the maximum-entropy method we construct an improved distribution for E. coli based on four moments. Finally we look at the order on the scale of the entire molecule (global scale). Applying the model of a random walk to the complete DNA genome we find that the Mandelbrot distribution on an intermediate level cannot explain the global character of the random walk, there being structure to the walk with features on the scale of the total length of the molecule (10 5–10 7 bp). To understand the three scales of order (local, intermediate and global) we construct a model sequence based on the incorporation of Mandelbrot-type order on the intermediate scale in a single size block. We then find that the character of the order on the local and global scales follows naturally from this single feature. Thus all three scales of order in DNA are incorporated into our model sequence.

Full Text