Severely imbalanced Big Data challenges: investigating data sampling approaches

Tawfiq Hasanin,Taghi M Khoshgoftaar,Richard A Bauder,Joffrey L Leevy

doi:10.1186/s40537-019-0274-4

Tawfiq Hasanin, Taghi M Khoshgoftaar + Show 2 more

Open Access

https://doi.org/10.1186/s40537-019-0274-4

Copy DOI

Journal: Journal of Big Data	Publication Date: Nov 30, 2019
Citations: 58	License type: open-access

Affiliation: Florida Atlantic University

Abstract

Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

Highlights

The exponential increase of raw data in recent years has been associated with technological advances in the fields of Data Mining (DM) and Machine Learning (ML) [1, 2]
Our work evaluates six data sampling approaches for addressing the effect that severe class imbalance has on Big Data analytics
For the Area Under the Receiver Operating Characteristic Curve (AUC) metric, the best sampled distribution ratios were obtained by Random Undersampling (RUS) at 90:10, SMOTE at 65:35, and RUS at 90:10 for Gradient-Boosted Trees (GBT), Logistic Regression (LR), and Random Forest (RF), respectively

Summary

Introduction

The exponential increase of raw data in recent years has been associated with technological advances in the fields of Data Mining (DM) and Machine Learning (ML) [1, 2]. The remainder of this paper is organized as follows: “Related work” section provides an overview of literature related to data sampling methods that address severe class imbalance in Big Data; “Case studies datasets” section presents the details of the Medicare, SlowlorisBig, and POST datasets; “Methodologies” section describes the different aspects of the methodologies used to develop and implement our approach, including the Big Data processing framework, one-hot encoding, sampling ratios, sampling techniques, learners, performance metrics, and framework design. In [23], Fernández et al provide an insight into imbalanced Big Data classification outcomes and challenges They compared RUS, ROS, and SMOTE using MapReduce with two subsets of the Evolutionary Computation for Big Data and Big Learning (ECBDL’14) dataset [24], while maintaining the original class ratio. As in the first case study, we provided statistics (shown in Table 2) based on the datasets generated after the application of various sampling techniques

Results and discussion

Method

Conclusion

14. The Apache Software Foundation

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Severely imbalanced Big Data challenges: investigating data sampling approaches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

A Comparison of Performance Metrics with Severely Imbalanced Network Security Big Data
Tawfiq Hasanin ... Joffrey L Leevy
-
Tawfiq Hasanin, et. al.Tawfiq Hasanin ... Joffrey L Leevy
01 Jul 2019
01 Jul 2019

Examining characteristics of predictive models with imbalanced big data
Tawfiq Hasanin ... Joffrey L Leevy
Journal of Big Data | VOL. 6
Tawfiq Hasanin, et. al.Tawfiq Hasanin ... Joffrey L Leevy
31 Jul 2019
Journal of Big Data | VOL. 6

Investigating class rarity in big data
Tawfiq Hasanin ... Richard A Bauder
Journal of Big Data | VOL. 7
Tawfiq Hasanin, et. al.Tawfiq Hasanin ... Richard A Bauder
16 Mar 2020
Journal of Big Data | VOL. 7

Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks
V M González-Barcenas ... R M Valdovinos
-
V M González-Barcenas, et. al.V M González-Barcenas ... R M Valdovinos
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Severely imbalanced Big Data challenges: investigating data sampling approaches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data