Boosting methods for multi-class imbalanced data classification: an experimental review

Jafar Tanha,Nazila Razzaghi,Negin Samadi,Yousef Abdi,Mohammad Asadpour

doi:10.1186/s40537-020-00349-y

Jafar Tanha, Nazila Razzaghi + Show 3 more

Open Access

https://doi.org/10.1186/s40537-020-00349-y

Copy DOI

Journal: Journal of Big Data	Publication Date: Sep 1, 2020
Citations: 198	License type: open-access

Affiliation: University of Tabriz

Abstract

Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multi-class imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multi-class imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multi-class boosting algorithms on various multi-class imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multi-class imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains.

Highlights

Imbalanced data set classification is a relatively new research line within the broader context of machine learning studies, which tries to learn from the skewed data distribution
For the sake of clarity, it should be noted that the library of all algorithms were installed using the pip Python installer, e.g., sudo pip install xgboost, except MEBoost, SMOTEBoost, and AdaCosts, which their implemented python source codes are freely available at GitHub3 repository
The results prove that both multi-class area under the curve (MAUC) and multi-class Matthews correlation coefficient (MMCC) are more discriminant than G-mean

Summary

Introduction

Imbalanced data set classification is a relatively new research line within the broader context of machine learning studies, which tries to learn from the skewed data distribution. Most of the standard machine learning algorithms show poor performance in this kind of datasets, because they tend to favor the majority class samples, resulting in poor predictive accuracy over the minority class [2]. Tanha et al J Big Data (2020) 7:70 important instances. They assume equal misclassification cost for all samples for minimizing the overall error rate. Learning from skew datasets becomes very important when many real-world classification problems are usually imbalanced, e.g. fault prediction [3], fraud detection [4], medical diagnosis [5], text classification [6], oil-spill detection in satellite images [7] and cultural modeling [8]. In software fault prediction, if the defective module is regarded as the positive class and non-defective module as negative, missing a defect (false negative) is much expensive than the false-positive error in testing phase of software development process [9]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Boosting methods for multi-class imbalanced data classification: an experimental review

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Multi-class WHMBoost: An ensemble algorithm for multi-class imbalanced data
Jiakun Zhao ... Si Chen
Intelligent Data Analysis | VOL. 26
Jiakun Zhao, et. al.Jiakun Zhao ... Si Chen
18 Apr 2022
Intelligent Data Analysis | VOL. 26

An Ensemble Pruning Approach Based on Reinforcement Learning in Presence of Multi-class Imbalanced Data
Lida Abdi ... Sattar Hashemi
-
Lida Abdi, et. al.Lida Abdi ... Sattar Hashemi
01 Jan 2014
01 Jan 2014

Enhancing classification performance of multi-class imbalanced data using the OAA-DB algorithm
Piyasak Jeatrakul ... Kok Wai Wong
-
Piyasak Jeatrakul, et. al.Piyasak Jeatrakul ... Kok Wai Wong
01 Jun 2012
01 Jun 2012

A survey of multi-class imbalanced data classification methods
Meng Han ... Shujuan Liu
Journal of Intelligent & Fuzzy Systems | VOL. 44
Meng Han, et. al.Meng Han ... Shujuan Liu
30 Jan 2023
Journal of Intelligent & Fuzzy Systems | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Boosting methods for multi-class imbalanced data classification: an experimental review

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data