Abstract

The content of guanine+cytosine varies markedly along the chromosomes of homeotherms and great effort has been devoted to studying this heterogeneity and its biological implications. Already before the DNA-sequencing era, however, it was established that the dinucleotides in the DNA of mammals in particular, and of most organisms in general, show striking over- and under-representations that cannot be explained by the base composition. Here we show that in the coding regions of vertebrates both GC content and codon occurrences are strongly correlated with such “motif preferences” even though we quantify the latter using an index that is not affected by the base composition, codon usage, and protein-sequence encoding. These correlations are likely to be the result of the long-term shaping of the primary structure of genic and non-genic DNA by a regime of mutation of which central features have been maintained by natural selection. We find indeed that these preferences are conserved in vertebrates even more rigidly than codon occurrences and we show that the occurrence-preference correlations are stronger in intronic and non-genic DNA, with the R2s reaching 99% when GC content is ∼0.5. The mutation regime appears to be characterized by rates that depend markedly on the bases present at the site preceding and at that following each mutating site, because when we estimate such rates of neighbor-base-dependent mutation (NBDM) from substitutions retrieved from alignments of coding, intronic, and non-genic mammalian DNA sorted and grouped by GC content, they suffice to simulate DNA sequences in which motif occurrences and preferences as well as the correlations of motif preferences with GC content and with motif occurrences, are very similar to the mammalian ones. The best fit, however, is obtained with NBDM regimes lacking strand effects, which indicates that over the long term NBDM switches strands in the germline as one would expect for effects due to loosely contained background transcription. Finally, we show that human coding regions are less mutable under the estimated NBDM regimes than under matched context-independent mutation and that this entails marked differences between the spectra of amino-acid mutations that either mutation regime should generate. In the Discussion we examine the mechanisms likely to underlie NBDM heterogeneity along chromosomes and propose that it reflects how the diversity and activity of lesion-bypass polymerases (LBPs) track the landscapes of scheduled and non-scheduled genome repair, replication, and transcription during the cell cycle. We conclude that the primary structure of vertebrate genic DNA at and below the trinucleotide level has been governed over the long term by highly conserved regimes of NBDM which should be under direct natural selection because they alter drastically missense-mutation rates and hence the somatic and the germline mutational loads. Therefore, the non-coding DNA of vertebrates may have been shaped by NBDM only epiphenomenally, with non-genic DNA being affected mainly when found in the proximity of genes.

Highlights

  • In mammals and birds, the amino-acid composition of proteins and the relative occurrence of synonymous codons in coding regions covary strongly with the GC content of the chromosomal regions in which genes are embedded

  • I) we find very similar motif preferences in non-genic and intronic DNA, which indicates strongly that motif preferences are the product of mutation pressure and genetic drift shaping coding and noncoding DNA alike; and ii) we recreate to a great extent the native occurrences and preferences, the native occurrence-preference correlations, and the native correlations between GC content and the GCvsAT balance, by bringing DNA sequences to equilibrium under regimes of neighbor-base-dependent mutation (NBDM) that we estimate from substitutions retrieved from alignments of nongenic, intronic, and coding mammalian DNA that were sorted and grouped by GC content

  • We presented above several lines of evidence indicating that basic features of the primary structure of coding, intronic, and non-genic vertebrate DNA at and below the trinucleotide level are phylogenetically highly conserved –if not outright convergent– and appear to be strongly shaped by regimes of neighbor-basedependent-mutation (NBDM) that must be highly conserved and that change markedly across genomic regions, generating strongest primary-structural departures from base-composition expectations when GC content is intermediate as well as motif over-/under-representations that increase linearly with GC

Read more

Summary

Introduction

The amino-acid composition of proteins and the relative occurrence of synonymous codons in coding regions covary strongly with the GC content of the chromosomal regions in which genes are embedded Explanations invoking GC content effects in particular, and basecomposition effects in general, cannot account for important primary-structural features of mammalian and avian coding regions. Nussinov introduced the use of randomizations of the placement of SCs along coding regions to estimate the statistical over-/under-representation across successive codons of di- and tri-nucleotide motifs, a randomization that does not allow the base composition, the amino acid content and sequence, and the SC usage to affect the estimates

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.