Abstract

A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein-protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call