A New Under-Sampling Method to Face Class Overlap and Imbalance

Angélica Guzmán-Ponce,José Salvador Sánchez,Rosa María Valdovinos,José Raymundo Marcial-Romero

doi:10.3390/app10155164

Angélica Guzmán-Ponce, José Salvador Sánchez + Show 2 more

Open Access

https://doi.org/10.3390/app10155164

Copy DOI

Abstract

Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.

Highlights

The class imbalance problem is a challenging situation common to many real-world applications such as fraud detection [1], fault/failure diagnosis [2], face recognition [3], text classification [4], sentiment analysis [5], and credit risk prediction [6], among others
Results reveal that the top-4 methods were Random under-sampling (RUS), Evolutionary under-sampling (EUS), EE, and Balance cascade (BC) with an average reduction of 91.55%, which correspond to the algorithms that produced perfectly-balanced data sets
The average reductions achieved with neighborhood cleaning rule (NCL), Tomek links (TL), ENN, one-sided selection (OSS), and SBC were extremely low, which suggests that these algorithms are not the most appropriate for under-sampling

Summary

Introduction

The class imbalance problem is a challenging situation common to many real-world applications such as fraud detection [1], fault/failure diagnosis [2], face recognition [3], text classification [4], sentiment analysis [5], and credit risk prediction [6], among others. A binary data set is said to be imbalanced when one of the classes (the minority or positive class, C + ) has a significantly lower number of instances in comparison to the other class (the majority or negative class, C − ) [7]. The disproportion between the number of positive and negative instances leads to a bias towards the majority class that may imply an important deterioration of the classification performance on the minority class. Many authors have asserted that the class imbalance distribution by itself does not represent a critical problem for classification, but when it is associated with other data complexity factors, it can significantly decrease the classification performance because traditional classifiers tend to err on many positive instances [8]. García et al [9] viewed that the combination of class imbalance and highly overlapping class distributions results in a significant performance deterioration of instance-based classifiers. Class overlap refers to ambiguous regions of the feature space where the prior probability of

Methods

Results

Conclusion