Abstract

The diversity of protein sequences that exists today has probably evolved from antecedent evolutionarily- conserved domain-like sequences (i.e. motifs, repeats, structural domains) encoded by short ancient genes. We have studied the statistical distributions of the occurrences of the domain-like families within proteins in the proteomes. A generalized hypergeometric stochastic process is introduced in order to model the evolution dynamics of these conserved sequences. We found that the limiting probability function associated with this process fits the empirical distributions for the 90 fully-sequence bacterial, archaeal and eukaryotic organisms. For eukaryotes, our limiting distribution is reduced to Waring's distribution. However, for many archaeal and bacterial organisms the empirical distributions degenerate to the Yule-like distribution. Comparison of all of these distributions implies critical evolutionary events, which lead to the proportional growth of the number of new protein-coding genes and proteome complexity in the eukaryotic organisms and suggest that evolution of many archaeal and bacterial organisms are subject to external global (ecological) forces. Best-fit model data predicts that (1) there are only ~ 5500 or so of the distinct InterPro domains in a given higher eukaryotic organism and that (2) a general trend in eukaryotic proteome evolution is described by the increase in frequency of multi-domain proteins composed of already-existing (older) distinct domains as oppose to creating new ones. Our model can be applicable for analysis of the evolution of word distributions in the texts and be used in other large-scale evolutional systems like the Internet, the economy and the universe.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call