Abstract

The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.

Highlights

  • The central dogma of molecular biology describes the irreversible flow of information in biological systems from nucleic acids to amino acids, whose combinations make up the main cellular components: proteins

  • We demonstrate that Kullback–Leibler divergence (KLD) correlates well with an organism’s phylogeny and amino acid utilization profile, in addition to correlating with the GC content of bacterial genomes

  • KLD was calculated for all predicted proteins encoded by 372 bacterial genomes and 835 phage genomes

Read more

Summary

Introduction

The central dogma of molecular biology describes the irreversible flow of information in biological systems from nucleic acids to amino acids, whose combinations make up the main cellular components: proteins. In principle, such flow of information is no different from other data storage and communication systems, and can be studied by the information theory (Shannon, 1948). Shannon’s index is increasingly being used as a bioinformatics tool to solve problems related to either network or genome context, e.g., comparative genomics, resolution-free metrics, motif classification, and sequence-independent correlations (De Domenico & Biamonte, 2016; Vinga, 2014). Von Neumann entropy, which originated from Shannon’s classical information theory, is used as a divergence parameter that could be implemented from spectral data to human microbiome networking (De Domenico & Biamonte, 2016)

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.