Abstract

Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.

Highlights

  • A class imbalance problem occurs when training a dataset that contains examples belonging to one class that significantly outnumber those belonging to the other class(es).The first class is normally referred to as the majority class, while the latter is referred to as the minority

  • The core problem with class imbalance is that classifiers trained on unequal training sets have a prediction bias that is associated with poor performance in the minority class(es)

  • The Easy Ensemble and Balance Cascade (EE&BC) [115] is maybe one of the most interesting undersampling approaches we found in the literature

Read more

Summary

Introduction

A class imbalance problem occurs when training a dataset that contains examples belonging to one class that significantly outnumber those belonging to the other class(es). The number of examples belonging to the minority class increases when oversampling is used, while undersampling methods reduce the number of examples from the majority class. Both approaches, in our opinion, have their own set of problems. Undersampling, on the other hand, eliminates examples that may be critical to the learning process, making it even worse This likewise produces positive results on paper, but the opposite is true in practice. Method for Machine Learning from Class-Imbalanced Datasets to avoid the drawbacks of both oversampling and undersampling.

Related Work
Oversampling
Undersampling
The Proposed Method
Results
Experiments and Results
D25 D27 D29
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call