A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets

Saptarshi Bej,Prashant Srivastava,Kristian Schulz,Olaf Wolkenhauer,Markus Wolfien

doi:10.1109/access.2021.3108450

Abstract

Labelled imbalanced data, used for classification problems, have an unequal distribution of samples over the classes. Traditional classification models, such as random forest, gradient boosting, face a problem when dealing with imbalanced datasets. Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach, referred to as ProWRAS (Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS) algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority class data, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS is multi-schematic by employing four oversampling schemes, each of which has its unique way to model the variance of the generated data. The proximity weighted clustering approach of ProWRAS allows one to generate low variance synthetic samples only in borderline clusters to avoid overlap with the majority class. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five state-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. Our results show that ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and $\kappa $ -score. Moreover, we have introduced a novel measure for classifier independence $\mathcal {J}$ -score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. Thus, ProWRAS is highly effective for homogeneous tabular data where convex modelling of the data space can be done. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts.

Highlights

Data originating from real-world problems are often imbalanced
We observe from our pilot study, that for k-Nearest neighbours (kNN) classifier Localized Random Affine Shadowsampling (LoRAS), CURE-Synthetic Minority Oversampling Technique (SMOTE), and Polynom-fit SMOTE are the best performers
The oversampling strategy of Polynom fit-SMOTE does not follow any of the strategies we considered for ProWRAS exactly, the use of the star topology, is quite similar to the Low global variance (LGV) strategy, which again is the successful strategy used by ProWRAS for most datasets for LR

Summary

Introduction

Data originating from real-world problems are often imbalanced. Labelled imbalanced data, used for classification problems, have an unequal distribution of samples over the classes. The classes with a higher amount of samples are called majority classes, and the classes with a smaller amount of samples are minority classes. Traditional Machine Learning based classification models, such as random forest or gradient boosting, face certain difficulties, while dealing with such imbalanced datasets.

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 9	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

An overlapping minimization-based over-sampling algorithm for binary imbalanced classification
Xuan Lu ... Yingchao Cheng
Engineering Applications of Artificial Intelligence | VOL. 133
Xuan Lu, et. al.Xuan Lu ... Yingchao Cheng
26 Feb 2024
Engineering Applications of Artificial Intelligence | VOL. 133

Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning
Daochen Zha ... Xia Ben Hu
-
Daochen Zha, et. al.Daochen Zha ... Xia Ben Hu
17 Oct 2022
17 Oct 2022

Improving interpolation-based oversampling for imbalanced data learning
Tuanfei Zhu ... Yonghe Liu
Knowledge-Based Systems | VOL. 187
Tuanfei Zhu, et. al.Tuanfei Zhu ... Yonghe Liu
05 Jul 2019
Knowledge-Based Systems | VOL. 187

Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data
Feifan Dai ... Jianhua Hu
Information Sciences | VOL. 569
Feifan Dai, et. al.Feifan Dai ... Jianhua Hu
09 Apr 2021
Information Sciences | VOL. 569

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access