SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors

Aimin Zhang,Hualong Yu,Zhangjun Huan,Xibei Yang,Shang Zheng,Shang Gao

doi:10.1016/j.ins.2022.02.038

Abstract

In recent years, class imbalance learning (CIL) has become an important branch of machine learning. The Synthetic Minority Oversampling TEchnique (SMOTE) is considered to be a benchmark algorithm among CIL techniques. Although the SMOTE algorithm performs well on the vast majority of class-imbalance tasks, it also has the inherent drawback of noise propagation. Many SMOTE-variants have been proposed to address this problem. Generally, the improved solutions conduct a hybrid sampling procedure, i.e., carrying out an undersampling process after SMOTE to remove noises. However, owing to the complexity of data distribution, it is sometimes difficult to accurately identify real instances of noise, resulting in low modeling quality. In this paper, we propose a more robust and universal SMOTE hybrid variant algorithm named SMOTE-reverse k-nearest neighbors (SMOTE-RkNN). The proposed algorithm identifies noise based on probability density but not local neighborhood information. Specifically, the probability density information of each instance is provided by RkNN, a well-known KNN variant. Noisy instances are found and deleted according to their relevant probability density. In experiments on 46 class-imbalanced data sets, SMOTE-RkNN showed promising results in comparison with several popular SMOTE hybrid variant algorithms.

Full Text