Abstract
Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT) algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases.
Highlights
Any two human genomes differ in a number of different ways
Using high quality experimental datasets, our results show that ELASPIC outperforms all other methods in predicting the effect of core, and interface mutations
We developed ELASPIC, a method to predict stability effects induced by mutations in the core of a domain and in the interface of a complex
Summary
Any two human genomes differ in a number of different ways. There are changes on the level of individual nucleotides (Single Nucleotide Polymorphisms – SNPs or Single Nucleotide Variants – SNVs, depending on frequency) as well as many larger ones, such as deletions, insertions, and copy number variations. A novel energy-based approach trained for the prediction of mutational effects in protein complexes has shown relatively good results, its performance was not comparable to the methods evaluating core mutations [36]. This result stresses the importance of training on modeled structures in order to predict effects of mutations proteome-wide, in which approx. We are already able to observe the expected trend between disease classes that is shown for core mutation effects (Figure 4B) We expect this trend to improve in the future with increasing number and diversity of structures of protein complexes in the PDB that could facilitate homology modelling
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have