Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning

Zhenfei Wang,Hongju Wang

doi:10.1109/access.2021.3067060

Abstract

Imbalanced learning is a common problem in data mining. There is a different distribution of data samples among other classes in the imbalanced datasets. It's a challenge for standard algorithms designed for balanced class distributions. Although there are various strategies to solve this problem, generating artificial data to achieve a relatively balanced class distribution is universal rather than directly modifying specific classification algorithms. The oversampled data can be combined with any user-specified algorithm without any restrictions. In this paper, we present a novel oversampling method, Global Data Distribution Weighted Synthetic Oversampling Technique (GDDSYN). By applying clustering, optimizing the selection criteria of the minority class samples that are used to generate synthetic samples, avoiding generating more noise samples. GDDSYN assigns weights for the number of synthetic samples to tackle the within-class imbalance and between-class imbalance simultaneously, according to the informative level of the sample and the sparsity of the cluster to which the sample belongs. The use of scores with Silhouette Coefficient and Mutual Information helps the k-means algorithm set a reasonable number of clusters for the minority and majority classes respectively so that the clustering effect can be guaranteed. Next, by using clustering information, synthetic samples' generation path is improved to avoid class overlap. Additionally, GDDSYN has been evaluated extensively on 10 artificial and 10 real-world data sets. The empirical results show that our method is outperforms or comparable with some other existing methods in terms of assessment metrics when artificial data generated by GDDSYN are used.

Full Text