Abstract
The imbalanced data classification problem widely exists in many real-world applications. Data resampling is a promising technique to deal with imbalanced data through either oversampling or undersampling. However, the traditional data resampling approaches simply take into account the local neighbor information to generate new instances in linear ways, leading to the generation of incorrect and unnecessary instances. In this study, we propose a new data resampling technique, namely, Gaussian Distribution based Oversampling (GDO), to handle the imbalanced data for classification. In GDO, anchor instances are selected from the minority class instances in a probabilistic way by taking into account the density and distance information carried by the minority instances. Then new minority instances are generated following a Gaussian distribution model. The proposed method is validated in experimental study by comparing with seven imbalanced learning approaches on 40 data sets from the KEEL repository and 10 large data sets from the UCI repository. Experimental results show that our method outperforms the other compared methods in terms of AUC, G-mean and memory usage with an increase in running time. We also apply GDO to deal with two real imbalanced data classification problems: Internet video traffic identification and metastasis detection of esophageal cancer. The experimental results once again validate the effectiveness of our approach.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Knowledge and Data Engineering
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.