Abstract

In a majority–minority classification problem, class imbalance in the dataset(s) can dramatically skew the performance of classifiers, introducing a prediction bias for the majority class. Assuming the positive (minority) class is the group of interest and the given application domain dictates that a false negative is much costlier than a false positive, a negative (majority) class prediction bias could have adverse consequences. With big data, the mitigation of class imbalance poses an even greater challenge because of the varied and complex structure of the relatively much larger datasets. This paper provides a large survey of published studies within the last 8 years, focusing on high-class imbalance (i.e., a majority-to-minority class ratio between 100:1 and 10,000:1) in big data in order to assess the state-of-the-art in addressing adverse effects due to class imbalance. In this paper, two techniques are covered which include Data-Level (e.g., data sampling) and Algorithm-Level (e.g., cost-sensitive and hybrid/ensemble) Methods. Data sampling methods are popular in addressing class imbalance, with Random Over-Sampling methods generally showing better overall results. At the Algorithm-Level, there are some outstanding performers. Yet, in the published studies, there are inconsistent and conflicting results, coupled with a limited scope in evaluated techniques, indicating the need for more comprehensive, comparative studies.

Highlights

  • Any dataset with unequal distribution between its majority and minority classes can be considered to have class imbalance, and in real-world applications, the severity of class imbalance can vary from minor to severe

  • Many studies we investigated in this paper generally lacked sufficient depth in the scope of their empirical investigation of the high-class imbalance problem in big data

  • Rio et al [51] generally has similar limitations that were discussed for Fernandez et al [32], but in addition, it has the following issues: the number of features for building trees seem too small compared to available 631 features; there is no clear indication why the authors selected the top 90 features; since the ECBDL14 data is used, a comparison with the other studies using that data would add value to the findings presented; inclusion of the popular Synthetic Minority OverSampling Technique (SMOTE) datasampling algorithm is missing; MapReduce, which is known to be sensitive to high-class imbalance is used, instead of the more efficient Apache Spark; and, the study seems like a subset of Fernandez et al [51]

Read more

Summary

Introduction

Any dataset with unequal distribution between its majority and minority classes can be considered to have class imbalance, and in real-world applications, the severity of class imbalance can vary from minor to severe (high or extreme). Rio et al [51] generally has similar limitations that were discussed for Fernandez et al [32], but in addition, it has the following issues: the number of features for building trees seem too small compared to available 631 features; there is no clear indication why the authors selected the top 90 features; since the ECBDL14 data is used, a comparison with the other studies using that data would add value to the findings presented; inclusion of the popular SMOTE datasampling algorithm is missing; MapReduce, which is known to be sensitive to high-class imbalance is used (as noted by multiple works surveyed in our study), instead of the more efficient Apache Spark; and, the study seems like a subset of Fernandez et al [51]. The proposed solution is a combination of algorithm level approaches (logistic regression with a regularization term) and data level approaches (question and answer modifications, over-sampling)

Findings
Discussion summary of surveyed works
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.