Abstract

Protein–protein interactions play a crucial role in all cellular functions and biological processes and mutations leading to their disruption are enriched in many diseases. While a number of computational methods to assess the effects of variants on protein–protein binding affinity have been proposed, they are in general limited to the analysis of single point mutations and have been shown to perform poorly on independent test sets. Here, we present mmCSM-PPI, a scalable and effective machine learning model for accurately assessing changes in protein–protein binding affinity caused by single and multiple missense mutations. We expanded our well-established graph-based signatures in order to capture physicochemical and geometrical properties of multiple wild-type residue environments and integrated them with substitution scores and dynamics terms from normal mode analysis. mmCSM-PPI was able to achieve a Pearson's correlation of up to 0.75 (RMSE = 1.64 kcal/mol) under 10-fold cross-validation and 0.70 (RMSE = 2.06 kcal/mol) on a non-redundant blind test, outperforming existing methods. Our method is freely available as a user-friendly and easy-to-use web server and API at http://biosig.unimelb.edu.au/mmcsm_ppi.

Highlights

  • Protein-protein interactions (PPIs) are a vital mechanism for regulation and coordination of most biological processes within the cell [1,2]

  • We evaluated the performance of mmCSM-PPI across 5 different types of cross-validations on our training set

  • The performance of mmCSM-PPI was compared to Discovery Studio and FoldX (Supplementary Table S11), which demonstrated that our approach significantly outperformed both in all metric evaluations (Supplementary Table S11)

Read more

Summary

MATERIALS AND METHODS

The data used in this work was derived from SKEMPI2 [12], a manually curated database of experimental data on thermodynamics and kinetic parameters for wild-type and mutant protein–protein complexes which have been mapped to protein structures available on the Protein Data Bank [54]. Since the entries in our dataset were not uniformly distributed across all protein–protein complexes (Supplementary Table S8), we evaluated the performance of our approach by randomly sampling up to 10 mutations per protein complex, repeated 10 times (generating 10 subsets), followed by randomly selecting 80% of entries for training and remaining 20% for testing, repeated 10 times (CV3) For this type of cross-validation, our predictive model was able to achieve Pearson’s, Kendall’s and Spearman’s correlations of 0.83, 0.63 and 0.81, again with small deviations over the repetitions (␴ = 0.03) (Figure 2A), and average RMSE = 1.85 kcal/mol (␴ = 0.40). Four hundred and ninety multiple point mutations were randomly selected across 81 differ-

Method
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call