Abstract

BackgroundImbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem.ResultsIn this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones.ConclusionsTo validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.

Highlights

  • Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples

  • 1) We propose a concept of pseudo-negative samples and present a pseudo-negative sampling method which is based on the max-relevance and min-redundancy Pearson correlation coefficient in supervised learning

  • The proposed pseudo-negative sampling algorithm Based on the aforementioned preliminaries, we propose a pseudo-negative sampling algorithm based on the maxrelevance and min-redundancy on Pearson correlation coefficient, which is called max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC)

Read more

Summary

Introduction

Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. The data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. The work is motivated by the real-world requirement in bioinformatic data processing: it is very common that negative samples greatly dominate positive samples, and this phenomena is called data imbalance problem. We cannot achieve genetic data mining with limited positive samples. Because of the lack of enough positive samples, the biologist cannot perform experiments. Some positive samples cannot be identified or categorised as negative samples which can be viewed defined as pseudo-negative samples. How to select these pseudo-negative samples will be an alternative method to solve the imbalanced data problem in bioinformatics

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call