An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset

Prasetyo Wibowo,Chastine Fatichah

doi:10.26594/register.v7i1.2206

Prasetyo Wibowo, Chastine Fatichah

Open Access

https://doi.org/10.26594/register.v7i1.2206

Copy DOI

Journal: Register	Publication Date: Feb 28, 2021
Citations: 8	License type: CC BY-NC-SA 4.0

Affiliation: Sepuluh Nopember Institute of Technology

Abstract

Class imbalance occurs when the distribution of classes between the majority and the minority classes is not the same. The data on imbalanced classes may vary from mild to severe. The effect of high-class imbalance may affect the overall classification accuracy since the model is most likely to predict most of the data that fall within the majority class. Such a model will give biased results, and the performance predictions for the minority class often have no impact on the model. The use of the oversampling technique is one way to deal with high-class imbalance, but only a few are used to solve data imbalance. This study aims for an in-depth performance analysis of the oversampling techniques to address the high-class imbalance problem. The addition of the oversampling technique will balance each class’s data to provide unbiased evaluation results in modeling. We compared the performance of Random Oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE techniques. All oversampling techniques will be combined with machine learning methods such as Random Forest, Logistic Regression, and k-Nearest Neighbor (KNN). The test results show that Random Forest with Borderline-SMOTE gives the best value with an accuracy value of 0.9997, 0.9474 precision, 0.8571 recall, 0.9000 F1-score, 0.9388 ROC-AUC, and 0.8581 PRAUC of the overall oversampling technique.

Full Text