An Improved Ensemble Classification Algorithm for Imbalanced Data with Sample Overlap

Yafei Zhang,Fei Han

doi:10.1007/978-981-19-6135-9_34

Abstract

AbstractClassification of imbalanced data remains an important topic of machine learning. Existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. The learning task becomes more difficult when there is data overlap in imbalanced data sets. Under the background of unbalanced data, this paper aims at the problem of sample overlap, and proposes an ensemble classification method of imbalanced data, namely unbalanced overlapping random forest (IORF). We consider the problem of sample overlap in imbalanced data classification and introduce the coefficient of sample difficulty to measure the importance of each training sample. When generating different data subsets according to the weighted bootstrap method, pay more attention to overlapping samples. Finally, the generated data subset is used to train the diverse decision tree for ensemble. In addition, in order to prevent the repeated selection of minority class samples from causing over-fitting of the classifier to the minority class samples, a data enhancement method based on Gaussian perturbation is proposed to reduce the over-fitting of the classifier to the overlapping minority class samples. The experimental results show that the proposed method can further improve the classification performance.KeywordEnsemble learningData imbalanceSample overlapData enhancementRandom forest

Full Text