Abstract

Identification of protein–protein interactions (PPIs) is a difficult and important problem in biology. Since experimental methods for predicting PPIs are both expensive and time-consuming, many computational methods have been developed to predict PPIs and interaction networks, which can be used to complement experimental approaches. However, these methods have limitations to overcome. They need a large number of homology proteins or literature to be applied in their method. In this paper, we propose a novel matrix-based protein sequence representation approach to predict PPIs, using an ensemble learning method for classification. We construct the matrix of Amino Acid Contact (AAC), based on the statistical analysis of residue-pairing frequencies in a database of 6323 protein–protein complexes. We first represent the protein sequence as a Substitution Matrix Representation (SMR) matrix. Then, the feature vector is extracted by applying algorithms of Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) on the SMR matrix. Finally, we feed the feature vector into a Random Forest (RF) for judging interaction pairs and non-interaction pairs. Our method is applied to several PPI datasets to evaluate its performance. On the dataset, our method achieves accuracy and sensitivity. Compared with existing methods, and the accuracy of our method is increased by percentage points. On the dataset, our method achieves accuracy and sensitivity, the accuracy of our method is increased by . On the PPI dataset, our method achieves accuracy and sensitivity, and the accuracy of our method is increased by . In addition, we test our method on a very important PPI network, and it achieves accuracy. In the Wnt-related network, the accuracy of our method is increased by . The source code and all datasets are available at https://figshare.com/s/580c11dce13e63cb9a53.

Highlights

  • Protein–protein interactions (PPIs) are fundamental importance to discover the molecular mechanism in biological systems

  • We independently analyze the performance of two protein representations, such as the Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD)

  • Our proposed method achieves a high performance on S. cerevisiae, H. pylori and Human datasets, so we evaluate the prediction performance of our model on five independent testing datasets

Read more

Summary

Introduction

Protein–protein interactions (PPIs) are fundamental importance to discover the molecular mechanism in biological systems. Many prediction methods have been developed for the large-scale analysis of PPIs. In recent years, many prediction methods have been developed for the large-scale analysis of PPIs These technologies refer to three categories of information, such as co-evolution information, natural language processing, and protein sequence feature. Lots of methods analyze the co-evolution trend of protein–protein interactions [1,2,3,4,5,6,7,8]. They extract the evolution information of homologous proteins via multiple sequence alignment. According to a certain semantic model, it automatically extracts relevant pieces of information from literature, as a large number of known PPIs are stored in biology and medicine relevant scientific literature

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call