A survey on addressing high-class imbalance in big data

Joffrey L Leevy,Taghi M Khoshgoftaar,Richard A Bauder,Naeem Seliya

doi:10.1186/s40537-018-0151-6

Joffrey L Leevy, Taghi M Khoshgoftaar + Show 2 more

Open Access

https://doi.org/10.1186/s40537-018-0151-6

Copy DOI

Abstract

In a majority–minority classification problem, class imbalance in the dataset(s) can dramatically skew the performance of classifiers, introducing a prediction bias for the majority class. Assuming the positive (minority) class is the group of interest and the given application domain dictates that a false negative is much costlier than a false positive, a negative (majority) class prediction bias could have adverse consequences. With big data, the mitigation of class imbalance poses an even greater challenge because of the varied and complex structure of the relatively much larger datasets. This paper provides a large survey of published studies within the last 8 years, focusing on high-class imbalance (i.e., a majority-to-minority class ratio between 100:1 and 10,000:1) in big data in order to assess the state-of-the-art in addressing adverse effects due to class imbalance. In this paper, two techniques are covered which include Data-Level (e.g., data sampling) and Algorithm-Level (e.g., cost-sensitive and hybrid/ensemble) Methods. Data sampling methods are popular in addressing class imbalance, with Random Over-Sampling methods generally showing better overall results. At the Algorithm-Level, there are some outstanding performers. Yet, in the published studies, there are inconsistent and conflicting results, coupled with a limited scope in evaluated techniques, indicating the need for more comprehensive, comparative studies.

Highlights

Any dataset with unequal distribution between its majority and minority classes can be considered to have class imbalance, and in real-world applications, the severity of class imbalance can vary from minor to severe
Many studies we investigated in this paper generally lacked sufficient depth in the scope of their empirical investigation of the high-class imbalance problem in big data
Rio et al [51] generally has similar limitations that were discussed for Fernandez et al [32], but in addition, it has the following issues: the number of features for building trees seem too small compared to available 631 features; there is no clear indication why the authors selected the top 90 features; since the ECBDL14 data is used, a comparison with the other studies using that data would add value to the findings presented; inclusion of the popular Synthetic Minority OverSampling Technique (SMOTE) datasampling algorithm is missing; MapReduce, which is known to be sensitive to high-class imbalance is used, instead of the more efficient Apache Spark; and, the study seems like a subset of Fernandez et al [51]

Summary

Introduction

Any dataset with unequal distribution between its majority and minority classes can be considered to have class imbalance, and in real-world applications, the severity of class imbalance can vary from minor to severe (high or extreme). Rio et al [51] generally has similar limitations that were discussed for Fernandez et al [32], but in addition, it has the following issues: the number of features for building trees seem too small compared to available 631 features; there is no clear indication why the authors selected the top 90 features; since the ECBDL14 data is used, a comparison with the other studies using that data would add value to the findings presented; inclusion of the popular SMOTE datasampling algorithm is missing; MapReduce, which is known to be sensitive to high-class imbalance is used (as noted by multiple works surveyed in our study), instead of the more efficient Apache Spark; and, the study seems like a subset of Fernandez et al [51]. The proposed solution is a combination of algorithm level approaches (logistic regression with a regularization term) and data level approaches (question and answer modifications, over-sampling)

Findings

Discussion summary of surveyed works

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Nov 1, 2018
Citations: 487	License type: open-access

R Discovery Prime

R Discovery Prime

A survey on addressing high-class imbalance in big data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Survey on deep learning with class imbalance
Justin M Johnson ... Taghi M Khoshgoftaar
Journal of Big Data | VOL. 6
Justin M Johnson, et. al.Justin M Johnson ... Taghi M Khoshgoftaar
19 Mar 2019
Journal of Big Data | VOL. 6

Examining characteristics of predictive models with imbalanced big data
Tawfiq Hasanin ... Taghi M Khoshgoftaar
Journal of Big Data | VOL. 6
Tawfiq Hasanin, et. al.Tawfiq Hasanin ... Taghi M Khoshgoftaar
31 Jul 2019
Journal of Big Data | VOL. 6

Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection
Richard A Bauder ... Taghi M Khoshgoftaar
-
Richard A Bauder, et. al.Richard A Bauder ... Taghi M Khoshgoftaar
01 Nov 2018
01 Nov 2018

Deep Learning and Data Sampling with Imbalanced Big Data
Justin M Johnson ... Taghi M Khoshgoftaar
-
Justin M Johnson, et. al.Justin M Johnson ... Taghi M Khoshgoftaar
01 Jul 2019
01 Jul 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A survey on addressing high-class imbalance in big data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data