A density-based oversampling approach for class imbalance and data overlap

Ruizhi Zhang,Shaowu Lu,Baokang Yan,Puliang Yu,Xiaoqi Tang

doi:10.1016/j.cie.2023.109747

Abstract

In data mining classification, class imbalance is characterized that different classes have an obvious difference in the number of samples. Most classifiers typically assume a balanced class distribution or assign equal classification error costs to different classes. Therefore, directly using imbalanced class will worsen the classification performance. The oversampling algorithms can achieve the balance by synthesizing new samples, but the uncontrollable positions of the synthetic samples may aggravate the data overlap and further deteriorate the classification performance. To tackle this challenge, an improved synthetic minority oversampling technique based on kernel density estimation and neighbor density selection (KDENDS_SMOTE) is proposed in this paper. First, each sample is mapped into a high-dimensional space to avoid the choice of the window width and to overcome the nonlinear separable limitation. Kernel density estimation is then used to derive the density ratio, which serves as a measure of the degree of data overlap. Subsequently, the stability degree of the density ratio is calculated using neighbor information, and a scoring mechanism combining the density ratio and its stability degree is proposed to assess the fitness of selected samples. Furthermore, the neighbor density selection based on the above scoring mechanism can guide SMOTE to generate new samples within a safe and stable region, away from areas with data overlap. Finally, compared with six advanced oversampling methods on fifteen real-world datasets, the KDENDS_SMOTE can effectively mitigate the data overlap and improve the classification performance.

Full Text