The imbalance problem: A comparison of sampling approaches using different parameters and feature selection methods in the context of classification

Jose L Morillo‐Salas,Amparo Alonso‐Betanzos,Verónica Bolón‐Canedo

doi:10.1111/exsy.13591

Abstract

AbstractA common situation in classification tasks is to deal with unbalanced datasets, an issue that appears when the majority class(es) has a large number of samples compared to the minority class(es). This problem is even more significant when the datasets have a large number of features but only a few samples, as is the case with microarray datasets. Traditionally, an approach to alleviate this problem has been the application of sampling methods to obtain more balanced classes, increasing the number of samples in the minority class (replicating samples or generating new synthetic samples), or decreasing the number of samples in the majority class. In this study, we have compared different balancing methods, including a novel method that applies sampling in both the minority and majority classes. The interest in applying feature selection in combination with balancing methods has also been explored. In view of the results, a recommendation of sampling method, feature selection, and classifier is proposed to improve the classification results according to the type of dataset.

Full Text