A Three-stage Method for Classification of Binary Imbalanced Big Data

Jun-Hai Zhai,Su-Fang Zhang,Yan Li,Mo-Han Wang

doi:10.1109/icmlc51923.2020.9469568

Abstract

In the real world, there are many imbalanced data classification problems, such as extreme weather prediction, software defect prediction, machinery fault diagnosis, spam filtering, etc. It has important theoretical and practical value to study the problem of imbalanced data classification. In the framework of binary imbalanced data classification, a three-stage method for classification of binary imbalanced big data was proposed in this paper. Specifically, in the first stage, the negative class big data was clustered into K clusters by K-means algorithm on Hadoop platform. In the second stage, we use instance selection method to select important samples from each cluster in parallel, and obtain K negative class subsets. In the third stage, we first construct K balanced training sets which consist of negative class subset and positive class subset, and then train K classifiers, and finally we integrate these classifiers to classify the unseen samples. Some experiments are conducted to compare the proposed method with two state-of-the-art methods on G-means. The experimental results demonstrate that the proposed method is more effective and efficient than the compared approaches.

Full Text