Abstract

As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of <0.00001 in healthy humans, consistent with very strong intolerance to gene inactivation.

Highlights

  • The MIT Faculty has made this article openly available

  • We apply it to the largest available collection of sequenced individuals to estimate the discovery power of much larger cohorts such as the ones proposed by the Precision Medicine Initiative

  • While our predictions here assumed that the samples are representative of the U.S demography, UnseenEst can be directly applied to estimate the discovery rate of cohorts with different ancestral composition

Read more

Summary

Introduction

The MIT Faculty has made this article openly available. Please share how this access benefits you. Predicting the number of new variants, we expect to identify in larger cohorts requires accurate estimates of allele frequencies of all the genetic variation in the human population, including the rare variants that have not been observed in the current sequencing cohorts[4,5,6]. Estimating the frequency distribution of genetic variation is closely related to the classic statistics problem of estimating the number of unseen animal species from capture experiments[13,14] Leveraging this connection, previous methods used Bayesian and jackknife approaches to estimate the discovery rate of new variants[4,5,15]. The jackknife is validated to produce accurate 20-fold extrapolation on small cohorts, such as the individuals from the 1,000 Genome populations[16]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.