Under-sampling technique for imbalanced data using minimum sum of euclidean distance in principal component subset

Chatchai Kasemtaweechok,Worasait Suwannik

doi:10.11591/ijai.v13.i1.pp305-318

Chatchai Kasemtaweechok, Worasait Suwannik

Open Access

https://doi.org/10.11591/ijai.v13.i1.pp305-318

Copy DOI

Abstract

<span lang="EN-US">Imbalanced datasets are characterized by a substantially smaller number of data points in the minority class compared to the majority class. This imbalance often leads to poor predictive performance of classification models when applied in real-world scenarios. There are three main approaches to handle imbalanced data: over-sampling, under-sampling, and hybrid approach. The over-sampling methods duplicate or synthesize data in the minority class. On the other hand, the under-sampling methods remove majority class data. Hybrid methods combine the noise-removing benefits of under-sampling the majority class with the synthetic minority class creation process of over-sampling. In this research, we applied principal component (PC) analysis, which is normally used for dimensionality reduction, to reduce the amount of majority class data. The proposed method was compared with eight state-of-the-art under-sampling methods across three different classification models: support vector machine, random forest, and AdaBoost. In the experiment, conducted on 35 datasets, the proposed method had higher average values for sensitivity, G-mean, the Matthews correlation coefficient (MCC), and receiver operating characteristic curve (ROC curve) compared to the other under-sampling methods.</span>

Full Text