Instance weighted SMOTE by indirectly exploring the data distribution

Aimin Zhang,Hualong Yu,Shanlin Zhou,Zhangjun Huan,Xibei Yang

doi:10.1016/j.knosys.2022.108919

Abstract

The synthetic minority oversampling technique (SMOTE) algorithm is considered a benchmark algorithm for addressing the class imbalance learning (CIL) problem. However, SMOTE fails to observe the distribution of the training data and to explore its internal structure, resulting in an unstable and non-robust classification result. Recently, more than 100 SMOTE variants have been developed to solve this problem. Most of them attempt to directly explore the prior distribution information of the training data, which may provide extremely inaccurate guidance in some classification scenarios. In this study, we present the instance weighted SMOTE (IW-SMOTE) algorithm, a more robust and universal solution for improving SMOTE by exploit distribution data indirectly. In particular, an UnderBagging-alike undersampling ensemble algorithm that uses classification and regression tree (CART) as the base classifier is first adopted to classify each training instance and acquire the corresponding confusing information. We can accurately estimate location information for each instance, including noise, borders and safety, based on the confusing information. Then, the noisy instances can be removed, and the borderline instances can be given more chances than the safe instances to be seed instances in the SMOTE procedure. Finally, the balanced instance set was used to train the CART, K nearest neighbors (KNN) and support vector machine (SVM) classifiers to verify whether the proposed algorithm is irrelevant to the specific classification model. We compare IW-SMOTE with several state-of-the-art SMOTE-based algorithms on many class imbalance data sets, and IW-SMOTE shows promising results.

Full Text