RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

Michał Koziarski,Colin Bellinger,Michał Woźniak

doi:10.1007/s10994-021-06012-8

Abstract

Real-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our 5times 2 cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.

Highlights

Machine learning classifiers are quickly becoming a tool of choice in application areas ranging from finance to robotics and medicine
We considered classification with a total of 9 different algorithms: CART decision tree, k-nearest neighbors classifier (KNN), support vector machine with linear (L-SVM), RBF (R-SVM) and polynomial (P-SVM) kernels, logistic regression (LR), Naive Bayes (NB), and multi-layer perceptron with ReLU (R-MLP) and linear (L-MLP) activation functions in the hidden layer
We proposed the Radial-Based Combined Cleaning and Resampling algorithm (RB-CCR)

Summary

Introduction

Machine learning classifiers are quickly becoming a tool of choice in application areas ranging from finance to robotics and medicine. This is largely owing to the growth in the availability of labeled training data and declining computing costs. Many of the most important domains, such as those related to health and safety, are limited by the problem of class imbalance. The induction of binary classifiers on imbalanced training data results in a predictive bias toward the majority class and has been associated with poor performance during application (Branco et al, 2016). Detailed empirical studies have demonstrated that class imbalance exacerbates the difficulty of learning accurate predictive models from complex data involving class overlap, sub-concepts, non-parametric distributions, etc. (He & Garcia, 2009; Stefanowski, 2016)

Methods

Results

Conclusion