Abstract

Imbalanced learning problems are a challenge faced by classifiers when data samples have an unbalanced distribution in each class. Furthermore, the synthetic oversampling method (SMOTE) is a preprocessing technique widely used to synthesize new data and balance the different numbers of samples in each class. One of the SMOTE method’s expansions is based on the initial selection approach, which determines the best candidates to be oversampled in the data before the process of synthetic example generation starts. However, SMOTE and most of the existing oversampling methods based on initial selection still found overlapping data on the final result. This issue makes it difficult for any classifiers to determine the decision boundary of each class. Therefore, this research proposes a new oversampling technique called Radius-SMOTE, which emphasizes the initial selection approach by creating synthetic data based on a safe radius distance. Furthermore, new synthetic data are prevented from overlapping in the opposite class with the safe radius distance. The Radius-SMOTE was evaluated extensively with thirteen artificial imbalanced datasets from the KEEL repository. The experimental results show that the proposed method is able to achieve the best results on 5 datasets, namely yeast-1-4-5-8_vs_7, ecoli-0-1-3-7_vs_2-6, Umbilical cord, Pima, and Haberman dataset in term of various assessment metrics. Besides that, the computational cost for our proposed method is also relatively low, with an average time of 0.5 to 1 second on the 13 tested datasets.

Highlights

  • An imbalance learning problem is a condition where there are disproportionate ratios in distributed sample data between classes

  • PROPOSED RADIUS Synthetic Minority Oversampling Technique (SMOTE) we describe the steps of the proposed oversampling method named Radius-SMOTE

  • This study proposes to replace the nearest neighbors parameter in the SMOTE method with safe radius distance, which is obtained by determining the nearest majority data point from the chosen sample

Read more

Summary

INTRODUCTION

An imbalance learning problem is a condition where there are disproportionate ratios in distributed sample data between classes. Some of the most popular works overcame this problem with initial approaches based on neighborhood selection samples, such as Borderline SMOTE [13] and Safe level SMOTE [14], as well as by assigning weight to sample MWMOTE [15] and ADAYSN [16] These synthetic oversampling methods have achieved some satisfactory results for imbalanced learning, they still possess some deficiencies. Using the KNN method to filter noise at the beginning of the oversampling process selects appropriate minority class data to be sampled This reduces the introduction of minority noise samples in areas belonging to the majority class and boundary disturbances between classes, thereby increasing the overlap between them.

RELATED WORK
CALCULATED SAFE RADIUS DISTANCE
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call