Abstract

Bacterial genomes exhibit a wide range of compositional diversity, most spectacularly represented by variation in genome GC content, which varies in different organisms from as low as 17% to as high as 75%. The nature of the biological processes underlying these differences has been long debated and two polarizing interpretations have been advanced, one proposing that GC content is driven by genome-specific mutational biases (the mutational hypothesis), and one that it reflects different selective processes in different organisms (the selectionist hypothesis). The hypothesis that differences in GC content are mostly driven by species-specific mutational biases [1] implies that smaller variation in GC content across genomes should be seen at positions that are most constrained by any form of purifying selection, and conversely that greatest variation should be observed in positions that are functionally neutral. Differences in GC content among prokaryotic genomes largely reflect on, or are driven by the GC content of protein coding sequences, which usually occupy the majority of the genome. When considering separately the GC content at the three codon positions of genes (GC 1 , GC 2 , GC 3 ), typical patterns are observed Figure 1-A. The GC content of all positions varies roughly linearly with the overall content of the genes, but variations in the first two codon positions, and especially in the second codon position, are much reduced compared to the variability observed in third codon positions, where the GC content spans across species almost all possible values from close to GC 3 = 0.0 to almost GC 3 = 1.0. These differences in variability are consistent with expected constraints imposed by the relation between codons and amino acids [2], with first and second codon positions mostly determining the amino acid type (and second codon position mostly determining the physico-chemical properties of the amino acid), and third codon position being mostly either synonymous, or encoding amino acids with similar properties. It is interesting to observe that the GC content of genomic intergenic regions closely correlates with the GC content of the coding sequences Figure 1 panel B, and it varies across genomes approximately to the same extent as it does in coding regions, and thus much less than in third codon positions. A simple toy model relating mutational bias to codon compositional substitutability can be advocated to explain the overall contrasts and variability in GC content observed between codon positions of different genomes. In this model, coding regions are represented as sequences formed from a two-letter alphabet {S, W} in which bases are identified either as Strong (S = G or C) or as Weak (W = A or T). Each sequence position is assumed either to evolve freely by substitution between S and W states, or to be constrained by purifying selection either in state S or in state W. Each of the three codon-base-positions i (i = 1, 2, or 3) will be then characterized by a codon-position-specific fraction () i s f of sites constrained to be of type S, a fractions () i w

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.