A Density-Based Random Forest for Imbalanced Data Classification

Jia Dong,Quan Qian

doi:10.3390/fi14030090

Abstract

Many machine learning problem domains, such as the detection of fraud, spam, outliers, and anomalies, tend to involve inherently imbalanced class distributions of samples. However, most classification algorithms assume equivalent sample sizes for each class. Therefore, imbalanced classification datasets pose a significant challenge in prediction modeling. Herein, we propose a density-based random forest algorithm (DBRF) to improve the prediction performance, especially for minority classes. DBRF is designed to recognize boundary samples as the most difficult to classify and then use a density-based method to augment them. Subsequently, two different random forest classifiers were constructed to model the augmented boundary samples and the original dataset dependently, and the final output was determined using a bagging technique. A real-world material classification dataset and 33 open public imbalanced datasets were used to evaluate the performance of DBRF. On the 34 datasets, DBRF could achieve improvements of 2–15% over random forest in terms of the F1-measure and G-mean. The experimental results proved the ability of DBRF to solve the problem of classifying objects located on the class boundary, including objects of minority classes, by taking into account the density of objects in space.

Highlights

Real-world datasets generally exhibit notable imbalances between different data classes, and the effectiveness of computational classification methods is typically limited by this uneven distribution
With the increase of the selected majority class samples, the role played by the density domain gradually became smaller, which deviated from our aim in proposing density-based random forest algorithm (DBRF) to improve the prediction of minority classes through additional training boundary samples
We proposed a density-based random forest algorithm (DBRF) to improve the prediction performance, especially for minority classes

Summary

Introduction

Real-world datasets generally exhibit notable imbalances between different data classes, and the effectiveness of computational classification methods is typically limited by this uneven distribution. Most classification algorithms are proposed based on balanced data, and their performance is reduced when processing imbalanced data. Samples are primarily divided into majority and minority types in imbalanced datasets, according to the number of samples. Given that most classification algorithms focus on overall accuracy as their key evaluation metric, they tend to perform better when classifying samples as belonging to the majority class and heavily neglect the minority class. The negative effect of majority classes misclassification is less important than that of minority classes in some real application scenarios. The development of new or improved methods to reduce the misclassification of minority classes is required

Objectives

Methods

Results

Discussion

Conclusion