Abstract

Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

Highlights

  • The exponential increase of raw data in recent years has been associated with technological advances in the fields of Data Mining (DM) and Machine Learning (ML) [1, 2]

  • Our work evaluates six data sampling approaches for addressing the effect that severe class imbalance has on Big Data analytics

  • For the Area Under the Receiver Operating Characteristic Curve (AUC) metric, the best sampled distribution ratios were obtained by Random Undersampling (RUS) at 90:10, SMOTE at 65:35, and RUS at 90:10 for Gradient-Boosted Trees (GBT), Logistic Regression (LR), and Random Forest (RF), respectively

Read more

Summary

Introduction

The exponential increase of raw data in recent years has been associated with technological advances in the fields of Data Mining (DM) and Machine Learning (ML) [1, 2]. The remainder of this paper is organized as follows: “Related work” section provides an overview of literature related to data sampling methods that address severe class imbalance in Big Data; “Case studies datasets” section presents the details of the Medicare, SlowlorisBig, and POST datasets; “Methodologies” section describes the different aspects of the methodologies used to develop and implement our approach, including the Big Data processing framework, one-hot encoding, sampling ratios, sampling techniques, learners, performance metrics, and framework design. In [23], Fernández et al provide an insight into imbalanced Big Data classification outcomes and challenges They compared RUS, ROS, and SMOTE using MapReduce with two subsets of the Evolutionary Computation for Big Data and Big Learning (ECBDL’14) dataset [24], while maintaining the original class ratio. As in the first case study, we provided statistics (shown in Table 2) based on the datasets generated after the application of various sampling techniques

Results and discussion
Method
Conclusion
14. The Apache Software Foundation
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.