Abstract

In data mining classification, class imbalance is characterized that different classes have an obvious difference in the number of samples. Most classifiers typically assume a balanced class distribution or assign equal classification error costs to different classes. Therefore, directly using imbalanced class will worsen the classification performance. The oversampling algorithms can achieve the balance by synthesizing new samples, but the uncontrollable positions of the synthetic samples may aggravate the data overlap and further deteriorate the classification performance. To tackle this challenge, an improved synthetic minority oversampling technique based on kernel density estimation and neighbor density selection (KDENDS_SMOTE) is proposed in this paper. First, each sample is mapped into a high-dimensional space to avoid the choice of the window width and to overcome the nonlinear separable limitation. Kernel density estimation is then used to derive the density ratio, which serves as a measure of the degree of data overlap. Subsequently, the stability degree of the density ratio is calculated using neighbor information, and a scoring mechanism combining the density ratio and its stability degree is proposed to assess the fitness of selected samples. Furthermore, the neighbor density selection based on the above scoring mechanism can guide SMOTE to generate new samples within a safe and stable region, away from areas with data overlap. Finally, compared with six advanced oversampling methods on fifteen real-world datasets, the KDENDS_SMOTE can effectively mitigate the data overlap and improve the classification performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call