Abstract

ABSTRACTReliable identification of near‐native poses of docked protein–protein complexes is still an unsolved problem. The intrinsic heterogeneity of protein–protein interactions is challenging for traditional biophysical or knowledge based potentials and the identification of many false positive binding sites is not unusual. Often, ranking protocols are based on initial clustering of docked poses followed by the application of an energy function to rank each cluster according to its lowest energy member. Here, we present an approach of cluster ranking based not only on one molecular descriptor (e.g., an energy function) but also employing a large number of descriptors that are integrated in a machine learning model, whereby, an extremely randomized tree classifier based on 109 molecular descriptors is trained. The protocol is based on first locally enriching clusters with additional poses, the clusters are then characterized using features describing the distribution of molecular descriptors within the cluster, which are combined into a pairwise cluster comparison model to discriminate near‐native from incorrect clusters. The results show that our approach is able to identify clusters containing near‐native protein–protein complexes. In addition, we present an analysis of the descriptors with respect to their power to discriminate near native from incorrect clusters and how data transformations and recursive feature elimination can improve the ranking performance. Proteins 2017; 85:528–543. © 2016 Wiley Periodicals, Inc.

Highlights

  • Specific protein–protein interactions are key to most cellular functions, ranging from effective signal transduction of environmental conditions to the nucleus to modulation of cell-cell interactions and efficient regulation of metabolic processes.[1,2,3]

  • We show that a reduced set of features based on these 109 molecular descriptors is beneficial for ranking performance and that dimensionality reduction and feature space transformations with methods such as principal component analysis (PCA) or factor analysis (FA) can improve the top 1 and top 5 ranking

  • The aim of this work is to establish a machine learning protocol which is able to rank near native clusters of docked protein–protein complexes over incorrect ones by using a wide set of currently available molecular descriptors important for protein–protein interactions

Read more

Summary

Introduction

Specific protein–protein interactions are key to most cellular functions, ranging from effective signal transduction of environmental conditions to the nucleus to modulation of cell-cell interactions and efficient regulation of metabolic processes.[1,2,3]. The first is to develop methods to efficiently sample the conformational space of the interacting proteins,[5,6] perhaps aided by experimental data.[7] The second is to be able to effectively rank docked poses, from typically thousands generated by current docking algorithms,[8] to identify docking ensembles (clusters), or single docked poses that resemble native-like binding. Computationally expensive refinement or relaxation methods based on conformational sampling from MD simulations[12] benefits from a reduced solution space. These are often required to correctly model conformational transitions from unbound to bound.[13]. Even though there have been a number of individual potentials developed for the identification of protein– protein interactions all of them suffer from false positive identifications of binding modes whereby incorrect solutions are ranked highly

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call