Abstract

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

Highlights

  • Obvious ethical and technical constraints prevent the large-scale engineering of loss-of-function mutations in humans

  • We describe the detection of pLoF variants in a cohort of 125,748 individuals with whole-exome sequence data and 15,708 individuals with whole-genome sequence data, as part of the Genome Aggregation Database, the successor to the Exome Aggregation Consortium (ExAC)

  • Some LoF variants will result in embryonic lethality in humans in a heterozygous state, whereas others are benign even at homozygosity, with a wide spectrum of effects in between. Throughout this manuscript, we define pLoF variants to be those that introduce a premature stop, shift-reported transcriptional frame, or alter the two essential splice-site nucleotides immediately to the left and right of each exon found in protein-coding transcripts, and ascertain their presence in the cohort of 125,748 individuals with exome sequence data. As these variants are enriched for annotation artefacts[1], we developed the loss-of-function transcript effect estimator (LOFTEE) package, which applies stringent filtering criteria from first principles to pLoF variants annotated by the variant effect predictor (Extended Data Fig. 5a)

Read more

Summary

Check for updates

We provide subsets of the gnomAD datasets, which exclude individuals who are cases in case–control studies, or who are cases of a few particular disease types such as cancer and neurological disorders, or who are aggregated in the Bravo TOPMed variant browser (https://bravo.sph.umich.edu) Among these individuals, we discovered 17.2 million and 261.9 million variants in the exome and genome datasets, respectively; these variants were filtered using a custom random forest process (Supplementary Information) to 14.9 million and 229.9 million high-quality variants. The number of putative de novo calls after filtering are in line with expectations[20] (Extended Data Fig. 3e–h), and our model had a recall of 97.3% for de novo SNVs and 98% for de novo indels based on 375 independently validated de novo variants in our whole-exome trios (295 SNVs and 80 indels) (Extended Data Fig. 3i, j) These results indicate that our filtering strategy produced a call-set with high precision and recall for both common and rare variants. These variants reflect the expected patterns based on mutation and selection: we observe 84.9% of all possible consistently methylated CpG-to-TpG transitions that would create synonymous variants in the human exome (Supplementary Table 14), which indicates that at this

Total observed
Sample size d
The LoF intolerance of human genes
Haploinsuf cient Autosomal recessive
Biological properties of constraint
Percentage of expression from constrained transcript
Educational attainment
Discussion
Online content
Genome Aggregation Database Consortium
Code availability
Reporting Summary
Data analysis
Life sciences study design
Population characteristics
Findings
Ethics oversight
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call