The unprecedented scale of genomic databases has revolutionized our ability to identify regions in the human genome intolerant to variation—regions often implicated in disease. However, these datasets remain constrained by limited ancestral diversity. Here, we analyze whole-exome sequencing data from 460,551 UK Biobank and 125,748 Genome Aggregation Database (gnomAD) participants across multiple ancestries to test several key intolerance metrics, including the Residual Variance Intolerance Score (RVIS), Missense Tolerance Ratio (MTR), and Loss-of-Function Observed/Expected ratio (LOF O/E). We demonstrate that increasing ancestral representation, rather than sample size alone, critically drives their performance. Scores trained on variation observed in African and Admixed American ancestral groups show higher resolution in detecting haploinsufficient and neurodevelopmental disease risk genes compared to scores trained on European ancestry groups. Most strikingly, MTR trained on 43,000 multi-ancestry exomes demonstrates greater predictive power than when trained on a nearly 10-fold larger dataset of 440,000 non-Finnish European exomes. We further find that European ancestry group-based scores are likely approaching saturation. These findings highlight the need for enhanced population representation in genomic resources to fully realize the potential of precision medicine and drug discovery. Ancestry group-specific scores are publicly available through an interactive portal: http://intolerance.public.cgr.astrazeneca.com/.
Read full abstract