Prediction of protein-protein interaction sites using an ensemble method

Qiwen Dong,Jihong Guan,Shuigeng Zhou,Lei Deng

doi:10.1186/1471-2105-10-426

Abstract

BackgroundPrediction of protein-protein interaction sites is one of the most challenging and intriguing problems in the field of computational biology. Although much progress has been achieved by using various machine learning methods and a variety of available features, the problem is still far from being solved.ResultsIn this paper, an ensemble method is proposed, which combines bootstrap resampling technique, SVM-based fusion classifiers and weighted voting strategy, to overcome the imbalanced problem and effectively utilize a wide variety of features. We evaluate the ensemble classifier using a dataset extracted from 99 polypeptide chains with 10-fold cross validation, and get a AUC score of 0.86, with a sensitivity of 0.76 and a specificity of 0.78, which are better than that of the existing methods. To improve the usefulness of the proposed method, two special ensemble classifiers are designed to handle the cases of missing homologues and structural information respectively, and the performance is still encouraging. The robustness of the ensemble method is also evaluated by effectively classifying interaction sites from surface residues as well as from all residues in proteins. Moreover, we demonstrate the applicability of the proposed method to identify interaction sites from the non-structural proteins (NS) of the influenza A virus, which may be utilized as potential drug target sites.ConclusionOur experimental results show that the ensemble classifiers are quite effective in predicting protein interaction sites. The Sub-EnClassifiers with resampling technique can alleviate the imbalanced problem and the combination of Sub-EnClassifiers with a wide variety of feature groups can significantly improve prediction performance.

Highlights

Prediction of protein-protein interaction sites is one of the most challenging and intriguing problems in the field of computational biology
In this study, inspired by the methods used by Zhao and Chen, we propose a hybrid approach, which incorporates bootstrap resampling technique, Support Vector Machine (SVM)-based fusion classifiers and weighted voting strategy, to overcome the imbalanced problem and improve the performance of protein interaction sites prediction
The combination of position-specific scoring matrices (PSSMs), evolutionary conservation score and sequence entropy outperforms the combination of PSSM and evolutionary conservation score, with a 2% improvement on AUC score, which means that sequence entropy is helpful to performance enhancement

Summary

Introduction

Prediction of protein-protein interaction sites is one of the most challenging and intriguing problems in the field of computational biology. Solving the puzzle of predicting the interaction sites is of great significance to molecular recognition. Interfaces have a significant number of polar residues [8,9], where usually the interactions are less permanent [10]. Protein core, and non-interface surface are found significantly different in sequence entropy and secondary structure [11]. Secondary structure composition appears to be of little discriminatory power, because neither α-helices nor β-sheets dominate at transient binding sites [12]. Evolutionary profiles and conservation score have been used in locating binding sites [13,14,15] with some success, since the interface core tends to be more conserved than the periphery in both obligate and nonobligate cases [16]

Methods

Results

Conclusion