Since minority samples are substantially less common than majority samples, many industrial applications, such as credit card fraud detection (CCFD) and defective part identification, call for imbalanced classification. The performance of a classifier tends to suffer from the noisy samples in majority or minority classes. This work proposes a new undersampling scheme, called a clustering-based noisy-sample-removed undersampling scheme (NUS) for imbalanced classification. The majority class samples are first clustered. The distance of the majority class sample from the cluster center that is furthest away is used as the radius to build a hypersphere, with each cluster’s center assumed to be a spherical center. We determine the Euclidean distance between the center of a cluster and each minority sample to find whether they are in the hypersphere or not. Afterward, we exclude noisy samples from the minority class. The noisy samples of majority classes are removed by using the same procedure. Second, we propose an NUS, which combines noisy sample removal with undersampling techniques. Finally, to prove the effectiveness of NUS, we integrate NUS with the basic classifiers random forest (RF), decision tree (DT), and logistics regression (LR). We conduct their comparison with seven undersampling, oversampling, and noisy-sample-removed methods. This work performs experiments on 13 public and three real transaction datasets related to e-commerce. The results show that NUS plays a positive role in promoting existing classifiers’ performance.
Read full abstract