Abstract
Class imbalance presents a major hurdle in the application of classification methods. A commonly taken approach is to learn ensembles of classifiers using rebalanced data. Examples include bootstrap averaging (bagging) combined with either undersampling or oversampling of the minority class examples. However, rebalancing methods entail asymmetric changes to the examples of different classes, which in turn can introduce their own biases. Furthermore, these methods often require specifying the performance measure of interest a priori, i.e., before learning. An alternative is to employ the threshold moving technique, which applies a threshold to the continuous output of a model, offering the possibility to adapt to a performance measure a posteriori, i.e., a plug-in method. Surprisingly, little attention has been paid to this combination of a bagging ensemble and threshold-moving. In this paper, we study this combination and demonstrate its competitiveness. Contrary to the other resampling methods, we preserve the natural class distribution of the data resulting in well-calibrated posterior probabilities. Additionally, we extend the proposed method to handle multiclass data. We validated our method on binary and multiclass benchmark data sets by using both, decision trees and neural networks as base classifiers. We perform analyses that provide insights into the proposed method.
Highlights
Dealing with a class imbalance in classification is an important problem that poses major challenges [1]
It is worth mentioning that, unsurprisingly, the area under the receiver operating characteristic (ROC) curve showed a much more cluttered picture, which we omit in the interest of space
(2) PTMA, Roughly balancing (RB)- and exactly balancing (EB)-bagging perform better on macro-accuracy while PTF1, SMOTE- and Random Balance (RNB)-bagging perform better on the macro F1-score. This shows that different resampling mechanisms are suitable for different performance measures; (3) PT-bagging – with appropriate thresholds – performed well in each of the evaluated measures, while the rest of methods performed poorly in at least one of them, e.g., RB-bagging performed poorly in macro F1-score and AUCPR, while SMOTE- and RNB-bagging performed poorly in macro-accuracy
Summary
Dealing with a class imbalance in classification is an important problem that poses major challenges [1]. Standard learning algorithms are often guided by global error rates and may ignore instances of the minority class, leading to models biased towards predicting the majority class. A first choice consists of preprocessing the data by resampling to balance the class distribution [8,9]. This is often achieved by either randomly oversampling (ROS) the minority class [9] or randomly undersampling (RUS) the majority class [10].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.