Gaussian Distribution Based Oversampling for Imbalanced Data Classification

Yuxi Xie,Zhenxiang Chen,Lizhi Peng,Haibo Zhang,Min Qiu

doi:10.1109/tkde.2020.2985965

Abstract

The imbalanced data classification problem widely exists in many real-world applications. Data resampling is a promising technique to deal with imbalanced data through either oversampling or undersampling. However, the traditional data resampling approaches simply take into account the local neighbor information to generate new instances in linear ways, leading to the generation of incorrect and unnecessary instances. In this study, we propose a new data resampling technique, namely, Gaussian Distribution based Oversampling (GDO), to handle the imbalanced data for classification. In GDO, anchor instances are selected from the minority class instances in a probabilistic way by taking into account the density and distance information carried by the minority instances. Then new minority instances are generated following a Gaussian distribution model. The proposed method is validated in experimental study by comparing with seven imbalanced learning approaches on 40 data sets from the KEEL repository and 10 large data sets from the UCI repository. Experimental results show that our method outperforms the other compared methods in terms of AUC, G-mean and memory usage with an increase in running time. We also apply GDO to deal with two real imbalanced data classification problems: Internet video traffic identification and metastasis detection of esophageal cancer. The experimental results once again validate the effectiveness of our approach.

Full Text