Abstract

High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives, may result in adverse consequences. Our paper presents two case studies, each utilizing a unique, combined approach of Random Undersampling and Feature Selection to investigate the effect of class imbalance on big data analytics. Random Undersampling is used to generate six class distributions ranging from balanced to moderately imbalanced, and Feature Importance is used as our Feature Selection method. Classification performance was reported for the Random Forest, Gradient-Boosted Trees, and Logistic Regression learners, as implemented within the Apache Spark framework. The first case study utilized a training dataset and a test dataset from the ECBDL’14 bioinformatics competition. The training and test datasets contain about 32 million instances and 2.9 million instances, respectively. For the first case study, Gradient-Boosted Trees obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive ratio of either 45:55 or 40:60. The second case study, unlike the first, included training data from one source (POST dataset) and test data from a separate source (Slowloris dataset), where POST and Slowloris are two types of Denial of Service attacks. The POST dataset contains about 1.7 million instances, while the Slowloris dataset contains about 0.2 million instances. For the second case study, Logistic Regression obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining Feature Selection with Random Undersampling improves the classification performance of learners with imbalanced big data from different application domains.

Highlights

  • A generally established set of data-related properties is used to characterize and define big data, including volume, variety, velocity, variability, value, and complexity [1]

  • All classifier performance results shown were obtained from the product of True Positive Rate (TPrate) and True Negative Rate (TNrate)

  • It is worth noting that using the original training data to build the models yielded TPrate × TNrate scores of 0, as the models failed to correctly classify any instances from the positive class

Read more

Summary

Introduction

A generally established set of data-related properties is used to characterize and define big data, including volume, variety, velocity, variability, value, and complexity [1]. Case study 1: ECBDL’14 dataset Our method for the first case study is sequentially outlined in six steps: (1) Select subsets of features using the FI function of the RF learner; (2) Implement one-hot encoding; (3) Create six different distribution ratios with RUS; (4) Distribute the datasets; (5) Train with the GBT, RF, and LR learners; and (6) Perform model prediction against a separate test set.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call