Abstract

AbstractThe problem of class imbalance in machine learning occurs when there is a relatively big disproportional distribution of classes in the data for classification tasks. In many real-world domains, such as in healthcare, finance, and predictive maintenance, the number of data points of a less important class (usually the negative class) is much higher than the class of greater interest (usually the positive or target class). This affects the ability of many learning algorithms to find good classification models. To address that, many approaches for solving this problem have been proposed, prominently including ensemble methods integrated with sampling-based techniques. However, these methods are still prone to the negative effects of sampling-based techniques that alter class distributions via over-sampling or under-sampling, which can lead to overfitting or discarding useful data, respectively, and thus affect performance. In this paper, we propose a new data preprocessing sampling technique dubbed as (sBal) for ensemble methods for binary classification in the case of imbalanced datasets. Our proposed method first turns the imbalanced dataset into several balanced bins/bags. Then multiple base learners are induced on the balanced bags and finally, the classification results are combined using a specific ensemble rule. We evaluated the performance of our proposed method on 50 imbalanced real-world binary datasets and compared its performance with well-known ensemble methods that utilize data preprocessing techniques namely SMOTEBagging, SMOTEBoost, RUSBoost, and RAMOBoost. The results reveal that the proposed method brings considerable improvement in classification performance relevant to the compared methods. We performed statistical significance analysis using Friedman’s non-parametric statistical test with Bergman post-hoc test. The analysis showed that our method performed significantly better than the majority of the methods across many datasets, suggesting a better preprocessing approach than the ones used in compared methods. We also highlight possible extensions to the method that can improve its effectiveness.KeywordsData samplingEnsemble methodsImbalanced datasetsSplit balancingRAMOBoostSMOTEBaggingSMOTEBoost

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.