Abstract

Protein-protein interaction (PPI) prediction is generally treated as a problem of binary classification wherein negative data sampling is still an open problem to be addressed. The commonly used random sampling is prone to yield less representative negative data with considerable false negatives. Meanwhile rational constraints are seldom exerted on model selection to reduce the risk of false positive predictions for most of the existing computational methods. In this work, we propose a novel negative data sampling method based on one-class SVM (support vector machine, SVM) to predict proteome-wide protein interactions between HTLV retrovirus and Homo sapiens, wherein one-class SVM is used to choose reliable and representative negative data, and two-class SVM is used to yield proteome-wide outcomes as predictive feedback for rational model selection. Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class PPI predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions. Some predictions have been validated by the recent literature. Lastly, gene ontology based clustering of the predicted PPI networks is conducted to provide valuable cues for the pathogenesis of HTLV retrovirus.

Highlights

  • A novel one-class SVM based negative data sampling method for reconstructing proteome-wide Human T-cell lymphotropic viruses (HTLV)-human protein interaction networks

  • We propose a novel negative data sampling method based on one-class SVM to predict proteome-wide protein interactions between HTLV retrovirus and Homo sapiens, wherein one-class SVM is used to choose reliable and representative negative data, and two-class SVM is used to yield proteome-wide outcomes as predictive feedback for rational model selection

  • Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class Protein-protein interaction (PPI) predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions

Read more

Summary

Introduction

A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks. Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class PPI predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions. To make a detour around negative data sampling, one-class learning/clustering methods have been proposed for PPI prediction, e.g. association rule mining, one-class SVM26,27, ensemble non-negative matrix factorization based clustering, etc These methods, though much simplified, are more likely to yield a large fraction of false positive predictions, because they do not learn the negative (non-interaction) patterns. To gain knowledge about the quality of model selection, one simple and natural method is to use the model to predict all possible (proteome-wide) or a large percentage of protein pairs, and check the false positives. For large-scale intra-species PPI prediction, the computation of model selection will be daunting, but the computation is acceptable for pathogen-host PPI prediction

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call