UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classification

Ming Zheng,Tong Li,Xiaoyao Zheng,Qingying Yu,Chuanming Chen,Ding Zhou,Changlong Lv,Weiyi Yang

doi:10.1016/j.ins.2021.07.053

Abstract

In the field of artificial intelligence, classification algorithms tend to be biased toward the majority class samples when encountering imbalanced data, resulting in low recognition rates for minority class samples. Undersampling techniques address this issue by decreasing the number of majority class samples to balance the original data distribution before the dataset is learned. However, current clustering-based undersampling methods have limitations that directly affect the original imbalanced dataset and the final classification performance. To address these problems, we propose a novel three-stage undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection (UFFDFR). This framework improves the classification performance on imbalanced data by removing noise and unrepresentative samples from the majority class. Experiments on 15 different imbalanced datasets demonstrate that UFFDFR effectively removed noise and unrepresentative majority class samples and improved classification performance. Furthermore, UFFDFR outperformed three classic and three state-of-the-art clustering-based undersampling methods in terms F-measure, G-mean, and AUC for five classification algorithms, which was confirmed by the Friedman and Nemenyi post-hoc statistical tests.

Full Text