Abstract

Missense variants are present amongst the healthy population, but some of them are causative of human diseases. A classification of variants associated with "healthy" or "diseased" states is therefore not always straightforward. A deeper understanding of the nature of missense variants in health and disease, the cellular processes they may affect, and the general molecular principles which underlie these differences is essential to offer mechanistic explanations of the true impact of pathogenic variants. Here, we have formalised a statistical framework which enables robust probabilistic quantification of variant enrichment across full-length proteins, their domains, and 3D structure-defined regions. Using this framework, we validate and extend previously reported trends of variant enrichment in different protein structural regions (surface/core/interface). By examining the association of variant enrichment with available functional pathways and transcriptomic and proteomic (protein half-life, thermal stability, abundance) data, we have mined a rich set of molecular features which distinguish between pathogenic and population variants: Pathogenic variants mainly affect proteins involved in cell proliferation and nucleotide processing and are enriched in more abundant proteins. Additionally, rare population variants display features closer to common than pathogenic variants. We validate the association between these molecular features and variant pathogenicity by comparing against existing in silico variant impact annotations. This study provides molecular details into how different proteins exhibit resilience and/or sensitivity towards missense variants and provides the rationale to prioritise variant-enriched proteins and protein domains for therapeutic targeting and development. The ZoomVar database, which we created for this study, is available at fraternalilab.kcl.ac.uk/ZoomVar. It allows users to programmatically annotate missense variants with protein structural information and to calculate variant enrichment in different protein structural regions.

Highlights

  • We present a multifactorial analysis of missense variants observed in the general population [30], in comparison to somatic cancer-associated missense variants from the COSMIC database [31] and disease-associated missense variants from the ClinVar database [32]

  • Our results indicate 2 competing trends for disease-associated variants: (i) diseaseassociated variants tend to localise to more abundant and stable proteins, which may suggest that these proteins are more sensitive to perturbation by variants; (ii) disease-associated variants in protein cores tend to localise to less stable proteins, which is consistent with the idea that such proteins might be more destabilised to a degree at which function is deleteriously impacted. gnomAD common data show negative correlations with protein stability, for variants occurring at the core; this could potentially support the argument presented by Mahlich and colleagues [47] that common variants could affect molecular function more than rare variants

  • Throughout this work, we show that missense variants in the general population, considered nominally healthy, show properties distinct from those in disease cohorts, from both macroscopic (“omics” features and functional pathways) and microscopic perspectives

Read more

Summary

Introduction

Our analyses highlight a striking difference in the enrichment of pathogenic and population variants, which depends upon their localisation to protein domains and structural features. GnomAD variants (both common and rare) and somatic variants falling outside of cancerrelated genes display the opposite trend, as variants tend to localise preferentially to protein surfaces, and are less likely to impact on protein structure and function than either core or interface mutations.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call