Abstract

In recent years, class imbalance learning (CIL) has become an important branch of machine learning. The Synthetic Minority Oversampling TEchnique (SMOTE) is considered to be a benchmark algorithm among CIL techniques. Although the SMOTE algorithm performs well on the vast majority of class-imbalance tasks, it also has the inherent drawback of noise propagation. Many SMOTE-variants have been proposed to address this problem. Generally, the improved solutions conduct a hybrid sampling procedure, i.e., carrying out an undersampling process after SMOTE to remove noises. However, owing to the complexity of data distribution, it is sometimes difficult to accurately identify real instances of noise, resulting in low modeling quality. In this paper, we propose a more robust and universal SMOTE hybrid variant algorithm named SMOTE-reverse k-nearest neighbors (SMOTE-RkNN). The proposed algorithm identifies noise based on probability density but not local neighborhood information. Specifically, the probability density information of each instance is provided by RkNN, a well-known KNN variant. Noisy instances are found and deleted according to their relevant probability density. In experiments on 46 class-imbalanced data sets, SMOTE-RkNN showed promising results in comparison with several popular SMOTE hybrid variant algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call