Abstract

Labelled imbalanced data, used for classification problems, have an unequal distribution of samples over the classes. Traditional classification models, such as random forest, gradient boosting, face a problem when dealing with imbalanced datasets. Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach, referred to as ProWRAS (Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS) algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority class data, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS is multi-schematic by employing four oversampling schemes, each of which has its unique way to model the variance of the generated data. The proximity weighted clustering approach of ProWRAS allows one to generate low variance synthetic samples only in borderline clusters to avoid overlap with the majority class. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five state-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. Our results show that ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and $\kappa $ -score. Moreover, we have introduced a novel measure for classifier independence $\mathcal {J}$ -score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. Thus, ProWRAS is highly effective for homogeneous tabular data where convex modelling of the data space can be done. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts.

Highlights

  • Data originating from real-world problems are often imbalanced

  • We observe from our pilot study, that for k-Nearest neighbours (kNN) classifier Localized Random Affine Shadowsampling (LoRAS), CURE-Synthetic Minority Oversampling Technique (SMOTE), and Polynom-fit SMOTE are the best performers

  • The oversampling strategy of Polynom fit-SMOTE does not follow any of the strategies we considered for ProWRAS exactly, the use of the star topology, is quite similar to the Low global variance (LGV) strategy, which again is the successful strategy used by ProWRAS for most datasets for LR

Read more

Summary

Introduction

Data originating from real-world problems are often imbalanced. Labelled imbalanced data, used for classification problems, have an unequal distribution of samples over the classes. The classes with a higher amount of samples are called majority classes, and the classes with a smaller amount of samples are minority classes. Traditional Machine Learning based classification models, such as random forest or gradient boosting, face certain difficulties, while dealing with such imbalanced datasets.

Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.