Abstract

An analytical model based on the statistical properties of Open Reading Frames (ORFs) of eubacterial genomes such as codon composition and sequence length of all reading frames was developed. This new model predicts the average length, maximum length as well as the length distribution of the ORFs of 70 species with GC contents varying between 21% and 74%. Furthermore, the number of annotated genes is predicted with high accordance. However, the ORF length distribution in the five alternative reading frames shows interesting deviations from the predicted distribution. In particular, long ORFs appear more often than expected statistically. The unexpected depletion of stop codons in these alternative open reading frames cannot completely be explained by a biased codon usage in the +1 frame. While it is unknown if the stop codon depletion has a biological function, it could be due to a protein coding capacity of alternative ORFs exerting a selection pressure which prevents the fixation of stop codon mutations. The comparison of the analytical model with bacterial genomes, therefore, leads to a hypothesis suggesting novel gene candidates which can now be investigated in subsequent wet lab experiments.

Highlights

  • The physical basis for heredity is the DNA double helix

  • The number and the typical length of Open Reading Frames (ORFs) may vary, bacteria share common characteristics of their open reading frame length distribution, which is correlated to their GC-content

  • An open reading frame is defined as the region between a start codon NTG, with N[N ~fA,G,C,Tg, followed by number of triplets (n§0) and concluded with one of the three possible stop codons (TAG, TGA, TAA)

Read more

Summary

Introduction

The physical basis for heredity is the DNA double helix. Proteins are encoded in Open Reading Frames (ORFs) delimited by a start and stop codon. The same tendency holds for a genome with a high GC-content of 75:9%, in this case 75% of all ORFs are shorter than 195 codons and a minority of 0:1% are larger than 1854 codons (own data) It is a well-known fact that the distribution of the overall ORF lengths correlates with the GCcontent of a genome, because stop codons being AT-rich. Oliver et al [2] calculated a theoretical stop codon probability depending on the GC-content, and the expected distribution of ORF lengths in a random model of independent and identically (IID) chosen nucleotides. They found for the latter that the probability to observe an ORF comprising more than 200 codons is rather small, despite varying the GC-content from 30% to 70%. Since most parts of bacterial genomes are covered by genes, the general statistical behavior of bacterial genomes is expected to be determined by the codon usage

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.