Abstract
Whole exome sequences have now been collected for millions of humans, with the related goals of identifying pathogenic mutations in patients and establishing reference repositories of data from unaffected individuals. As a result, we are approaching an important limit, in which datasets are large enough that, in the absence of natural selection, every highly mutable site will have experienced at least one mutation in the genealogical history of the sample. Here, we focus on CpG sites that are methylated in the germline and experience mutations to T at an elevated rate of ~10-7 per site per generation; considering synonymous mutations in a sample of 390,000 individuals, ~ 99 % of such CpG sites harbor a C/T polymorphism. Methylated CpG sites provide a natural mutation saturation experiment for fitness effects: as we show, at nt sample sizes, not seeing a non-synonymous polymorphism is indicative of strong selection against that mutation. We rely on this idea in order to directly identify a subset of CpG transitions that are likely to be highly deleterious, including ~27 % of possible loss-of-function mutations, and up to 20 % of possible missense mutations, depending on the type of functional site in which they occur. Unlike methylated CpGs, most mutation types, with rates on the order of 10-8 or 10-9, remain very far from saturation. We discuss what these findings imply for interpreting the potential clinical relevance of mutations from their presence or absence in reference databases and for inferences about the fitness effects of new mutations.
Highlights
A central goal of human genetics is to identify pathogenic mutations and predict how likely they are to cause disease
An attractive feature of methylated CpG sites is that a single mechanism, the spontaneous deamination of methyl-cytosine, is believed to underlie the uniquely high rate of C>T mutations at these sites [31]; germline methylation at CpG sites is strongly predictive of their mutability [33,34,35]; Fig. 1-figure supplement 2)
We define “methylated” CpG sites in exons as those that are methylated ≥65% of the time in both testes and ovaries. For these ~1.1 million sites, we calculate a mean haploid, autosomal C>T mutation rate of 1.17 x 10-7 per generation using de novo mutations (DNMs) in a sample of ~2900 sequenced parent-offspring trios (Methods, Fig. 1-figure supplements 1-2, ref 36)
Summary
A central goal of human genetics is to identify pathogenic mutations and predict how likely they are to cause disease. To this end, exome sequencing in cases and controls is often used to help identify variants with potentially large effects on disease risk (e.g., refs 1–4). Exome sequencing in cases and controls is often used to help identify variants with potentially large effects on disease risk (e.g., refs 1–4) Even where this approach yields an enrichment of variants in cases, the specific subset of mutations that contributes to disease often remains unknown; in individual patients, sequencing habitually yields candidate mutations of which the significance is unclear [5,6]. Comparisons of sequences across species have been widely used to identify highly conserved genomic regions maintained by selection over millions of years, presumably because of their functional importance (e.g., refs 7,9,13,14)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.