Abstract

The class imbalance phenomenon is pervasive in bioinformatics prediction problems in which the number of majority samples is significantly larger than that of minority samples. Relieving the severity of class imbalance has been demonstrated to be a promising route for enhancing the prediction performance of a statistical machine learning-based predictor under an imbalanced learning scenario. In this study, we propose a novel dynamic query-driven sample rescaling (DQD-SR) strategy for addressing class imbalance. Unlike the traditional sample rescaling technique, which often yields a fixed balanced dataset, the proposed DQD-SR dynamically generates a query-driven balanced dataset based on KNN algorithm. A prediction model trained on a traditional sample rescaling (T-SR)-derived balanced dataset will partially learn the global knowledge buried in the original dataset, whereas a prediction model trained on DQD-SR will reflect the query-specific local knowledge between a query sample and its correlated neighbors in the original dataset. Thus, we developed an ensemble scheme to integrate the T-SR-based model and the DQD-SR-based model to further improve the overall prediction performance. To demonstrate the efficacy of the proposed method, we performed stringent cross-validation and independent validation tests on benchmark datasets concerning protein–nucleotide binding residues prediction, which is a typical imbalanced learning problem in bioinformatics. Computer experimental results show that the proposed method achieves high prediction performance and outperforms existing sequence-based protein–nucleotide binding residues predictors. We also implemented a predictor called TargetNUCs, which is freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetNUCs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call