Average mutual information of coding and noncoding DNA.

Ivo Grosse,Dirk Holste,H Eugene Stanley,Hanspeter Herzel,Sergey V Buldyrev

doi:10.1142/9789814447331_0059

Abstract

One basic problem in the analysis of DNA sequences is the recognition of protein-coding genes. Computer algorithms to facilitate gene identification have become important as genome sequencing projects have turned from mapping to large-scale sequencing, resulting in an exponentially growing number of sequenced nucleotides that await their annotation. Many statistical patterns have been discovered that are different in coding and noncoding DNA, but most of them vary from species to species, and hence require prior training on organism-specific data sets. Here, we investigate if there exist species-independent statistical patterns that are different in coding and noncoding DNA. We introduce an information-theoretic quantity, the average mutual information (AMI), and we find that the probability distribution functions of the AMI are significantly different in coding and noncoding DNA, while they are almost identical for different species. This finding suggests that the AMI might be useful for the recognition of protein-coding regions in genomes for which training sets do not exist.

Full Text