Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae.

Julian Zubek,Adam Boniecki,Subhadip Basu,Maciej Mnich,Dariusz Plewczynski,Marcin Tatjewski

doi:10.7717/peerj.1041

Abstract

Accurate identification of protein–protein interactions (PPI) is the key step in understanding proteins’ biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein–protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein–protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

Highlights

Systems biology and bioinformatics study interactions between various biocomponents of living cells that spans across multiple spatial and temporal scales
In our experiments we stuck to the rule that during classifier training we can use all the information available in Protein Data Bank (PDB) complexes, but in the evaluation phase only information derived from the sequence is allowed
This was to demonstrate that our method can be employed successfully in a situation when only protein sequences are known

Summary

Introduction

Systems biology and bioinformatics study interactions between various biocomponents of living cells that spans across multiple spatial and temporal scales. Proteins are characterised in multiple scales: first in the microscale, by their local post-translational modifications; second, by the interactions with metabolites and small chemical molecules (inhibitors); third in the mesoscale, by the three-dimensional structure of active sites, or interaction interfaces; fourth in the macroscale, by the global 3D structure that comprises the macromolecular complexes; and in the time-scale, by their dynamical properties related to the changes of their structure, or physico-chemical properties upon participating in the given biophysical process. We transform the scores matrix into a fixed length input vector suitable for further statistical data analysis (aggregated values over columns, diagonals, etc.), and we identify the network properties (e.g., sizes of connected components) using interaction graph This data is used by the level-II predictor, which integrates information to a human expert.

Methods

Results

Conclusion