Abstract

BackgroundSupervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set. Unlike positive interactions, negative interactions cannot be readily obtained from interaction data, so these must be generated. In protein-protein interactions and other molecular interactions as well, taking all non-positive interactions as negative interactions produces too many negative interactions for the positive interactions. Random selection from non-positive interactions is unsuitable, since the selected data may not reflect the original distribution of data.ResultsWe developed a bootstrapping algorithm for generating a negative data set of arbitrary size from protein-protein interaction data. We also developed an efficient boosting algorithm for finding interacting motif pairs in human and virus proteins. The boosting algorithm showed the best performance (84.4% sensitivity and 75.9% specificity) with balanced positive and negative data sets. The boosting algorithm was also used to find potential motif pairs in complexes of human and virus proteins, for which structural data was not used to train the algorithm. Interacting motif pairs common to multiple folds of structural data for the complexes were proven to be statistically significant. The data set for interactions between human and virus proteins was extracted from BOND and is available at . The complexes of human and virus proteins were extracted from PDB and their identifiers are available at .ConclusionWhen the positive and negative training data sets are unbalanced, the result via the prediction model tends to be biased. Bootstrapping is effective for generating a negative data set, for which the size and distribution are easily controlled. Our boosting algorithm could efficiently predict interacting motif pairs from protein interaction and sequence data, which was trained with the balanced data sets generated via the bootstrapping method.

Highlights

  • Supervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set

  • Our boosting algorithm could efficiently predict interacting motif pairs from protein interaction and sequence data, which was trained with the balanced data sets generated via the bootstrapping method

  • We developed a bootstrapping algorithm for generating a negative data set of arbitrary size from proteinprotein interaction data

Read more

Summary

Introduction

Supervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set. Supervised learning or stochastic methods are often used to predict linear motifs involved in protein-protein interactions. Both negative and positive interactions are required to train the methods. Negative samples cannot be readily obtained from protein-protein interaction data. Assuming a negative interaction where there is no explicit evidence of a positive interaction results in a much larger negative data set than a positive data set. Generating a negative data set via random selection often does not reflect the original distribution of data, it does not produce a good prediction model

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.