Abstract
Oversampling is a popular yet useful method to fulfill the binary classification of imbalanced data, however many existing results of oversampling are very likely to generate redundant/unsafe/noise samples due primarily to the inadequate consideration of the data distribution. To address this issue, we propose a novel oversampling approach, namely Switching Synthesizing-Incorporated and Cluster-Based Synthetic Oversampling (SSI-CBSO). The core idea of SSI-CBSO is four-fold: (1) noise samples are removed by using K nearest neighbor strategy and Fuzzy C-Means clustering is adopted for the filtered data in the minority class; (2) the number of samples that need to be synthesized is adaptively assigned to each cluster concerning the inter-class distance and the intra-cluster similarity; (3) to better reflect the data distribution, a new method in terms of the concept of the hypersphere is put forward to measure the cluster density in a high dimensional; and (4) a new principle based on the Mahalanobis distance is provided for a better selection of the target sample. Then, a switching synthesizing strategy is established to guarantee the safety of the synthesized samples. Finally, experiments on 13 binary imbalanced data sets by using five evaluation metrics with four classifiers verify that our proposed SSI-CBSO approach can obtain desirable results.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have