Imbalanced Big Data Research Articles

High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives, may result in adverse consequences. Our paper presents two case studies, each utilizing a unique, combined approach of Random Undersampling and Feature Selection to investigate the effect of class imbalance on big data analytics. Random Undersampling is used to generate six class distributions ranging from balanced to moderately imbalanced, and Feature Importance is used as our Feature Selection method. Classification performance was reported for the Random Forest, Gradient-Boosted Trees, and Logistic Regression learners, as implemented within the Apache Spark framework. The first case study utilized a training dataset and a test dataset from the ECBDL’14 bioinformatics competition. The training and test datasets contain about 32 million instances and 2.9 million instances, respectively. For the first case study, Gradient-Boosted Trees obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive ratio of either 45:55 or 40:60. The second case study, unlike the first, included training data from one source (POST dataset) and test data from a separate source (Slowloris dataset), where POST and Slowloris are two types of Denial of Service attacks. The POST dataset contains about 1.7 million instances, while the Slowloris dataset contains about 0.2 million instances. For the second case study, Logistic Regression obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining Feature Selection with Random Undersampling improves the classification performance of learners with imbalanced big data from different application domains.

Nowadays, big data plays a substantial part in information knowledge analysis, manipulation, and forecasting. Analyzing and extracting knowledge from such big datasets are a very challenging task due to the imbalance of data distribution, which could lead to a biased classification results and wrong decisions. The standard classifiers are not capable of handling such datasets. Hence, a new technique for dealing with such datasets is required. This paper proposes a novel classification framework for big data that consists of three developed phases. The first phase is the feature selection phase, which uses the Whale optimization algorithm (WOA) for finding the best set of features. The second phase is the preprocessing phase, which uses the SMOTE algorithm and the LSH-SMOTE algorithm for solving the class imbalance problem. Lastly, the third phase is WOA + BRNN algorithm, which is using the Whale optimization algorithm for training a deep learning approach called bidirectional recurrent neural network for the first time. Our proposed algorithm WOA-BRNN has been tested against nine highly imbalanced datasets one of them is big dataset in terms of area under curve (AUC) against four of the most common use machine learning algorithms (Naive Bayes, AdaBoostM1, decision table, random tree), in addition to GWO-MLP (training multilayer perceptron using Gray Wolf Optimizer), then we test our algorithm over four well-known datasets against GWO-MLP and particle swarm optimization (PSO-MLP), genetic algorithm (GA-MLP), ant colony optimization (ACO-MLP), evolution strategy (ES-MLP), and population-based incremental learning (PBIL-MLP) in terms of classification accuracy. Experimental results proved that our proposed algorithm WOA + BRNN has achieved promising accuracy and high local optima avoidance, and outperformed four of the most common use machine learning algorithms, and GWO-MLP in terms of AUC.

Imbalanced Big Data Research Articles

Related Topics

Articles published on Imbalanced Big Data

Parallel computing of fuzzy integrals: Performance and test

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

Experimental evaluation of ensemble classifiers for imbalance in Big Data

A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

An adaptive synthesis to handle imbalanced big data with deep siamese network for electricity theft detection in smart grids

Distributed classification for imbalanced big data in distributed environments

Multi-class imbalanced big data classification on Spark

Gaussian Discriminative Analysis aided GAN for imbalanced big data augmentation and fault classification

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Investigating the relationship between time and predictive model maintenance

Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region

PSU: Particle Stacking Undersampling Method for Highly Imbalanced Big Data

Classification of Imbalanced Big Data using SMOTE with Rough Random Forest

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Imbalanced Big Data Classification using Feature Selection Under-Sampling

Severely imbalanced Big Data challenges: investigating data sampling approaches

Examining characteristics of predictive models with imbalanced big data

Empirical Evaluation of Map Reduce Based Hybrid Approach for Problem of Imbalanced Classification in Big Data

Benchmarking framework for class imbalance problem using novel sampling approach for big data

WOA + BRNN: An imbalanced big data classification framework using Whale optimization and deep neural network

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Imbalanced Big Data Research Articles

Related Topics

Articles published on Imbalanced Big Data

Parallel computing of fuzzy integrals: Performance and test

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

Experimental evaluation of ensemble classifiers for imbalance in Big Data

A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

An adaptive synthesis to handle imbalanced big data with deep siamese network for electricity theft detection in smart grids

Distributed classification for imbalanced big data in distributed environments

Multi-class imbalanced big data classification on Spark

Gaussian Discriminative Analysis aided GAN for imbalanced big data augmentation and fault classification

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Investigating the relationship between time and predictive model maintenance

Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region

PSU: Particle Stacking Undersampling Method for Highly Imbalanced Big Data

Classification of Imbalanced Big Data using SMOTE with Rough Random Forest

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Imbalanced Big Data Classification using Feature Selection Under-Sampling

Severely imbalanced Big Data challenges: investigating data sampling approaches

Examining characteristics of predictive models with imbalanced big data

Empirical Evaluation of Map Reduce Based Hybrid Approach for Problem of Imbalanced Classification in Big Data

Benchmarking framework for class imbalance problem using novel sampling approach for big data

WOA + BRNN: An imbalanced big data classification framework using Whale optimization and deep neural network