Cluster-based under-sampling with random forest for multi-class imbalanced classification

Md Yasir Arafat,Dewan Md Farid,Sabera Hoque

doi:10.1109/skima.2017.8294105

Abstract

Multi-class imbalanced classification has emerged as a very challenging research area in machine learning for data mining applications. It occurs when the number of training instances representing majority class instances is much higher than that of minority class instances. Existing machine learning algorithms provide a good accuracy when classifying majority class instances, but ignore/ misclassify the minority class instances. However, the minority class instances hold the most vital information and misclassifying them can lead to serious problems. Several sampling techniques with ensemble learning have been proposed for binary-class imbalanced classification in the last decade. In this paper, we propose a new ensemble learning technique by employing cluster-based under-sampling with random forest algorithm for dealing with multi-class highly imbalanced data classification. The proposed approach cluster the majority class instances and then select the most informative majority class instances in each cluster to form several balanced datasets. After that random forest algorithm is applied on balanced datasets and applied majority voting technique to classify test/ new instances. We tested the performance of our proposed method with existing popular sampling with boosting methods like: AdaBoost, RUSBoost, and SMOTEBoost on 13 benchmark imbalanced datasets. The experimental results show that the proposed cluster-based under-sampling with random forest technique achieved high accuracy for classifying both majority and minority class instances in compare with existing methods.

Full Text