Abstract

Big data is an important factor almost in all nowadays technologies, such as, social media, smart cities, and internet of things. Most of standard classifiers tends to be trapped in local optima problem when dealing with such massive datasets. Hence, investigating new techniques for dealing with such massive data sets is required. This paper presents a novel imbalanced big data mining framework for improving optimization algorithms performance by eliminating the local optima problem consists of three main stages. Firstly, the preprocessing stage, which uses the LSH-SMOTE algorithm for solving the class imbalance problem, then it uses the LSH algorithm for hashing the data set instances into buckets. Secondly, the bucket search stage, which uses the GWO for training bidirectional recurrent neural network BRNN and searching for the global optimum in each bucket. Lastly, the final tournament winner stage, which uses the GWO+BRNN for finding the global optimum of the whole data set among all global optimums from all buckets. Our proposed framework LSHGWOBRNN has been tested over 9 data sets one of them is big data set in terms of AUC, MSE, against seven well-known machine-learning algorithms (Naive Bayes, Random Tree, Decision Table, and AdaBoostM1, WOA+MLP, GWO+MLP, and WOA+BRNN), then, we tested our algorithm over four well-known data sets against GWO+MLP, ACO+MLP, GA+MLP, PSO+MLP, PBIL+MLP, and ES+MLP in terms of classification accuracy and MSE. Our experimental results have proved that our proposed framework LSHGWOBRNN has provided high local optima avoidance, and higher accuracy, less complexity and overhead.

Highlights

  • The rapid growth of smart devices, internet of things, smart cities and massive number of sensors networks are leading the world to be flooded by a gigantic amount of data generated from numerous sources, such as social networks, sensor networks data, video broadcasting sites, bioinformatics, internet marketing and more

  • Our proposed framework LSHGWOBRNN will be tested against seven classifiers (Naive Bayes, AdaBoostM1, Decision Table, and Random Tree), in addition to Grey wolf Optimizer (GWO)+Multilayer perceptron (MLP), which is published in 2015 [62], WOA+MLP, and WOA+Bidirectional Recurrent Neural Networks (BRNN) [71], will be performed over eight highly imbalanced data sets obtained from the KEEL Data Set Repository (Imbalance ratio higher than 9) [63], and one big dataset that has been used in ECBDL 14 Big Data Mining Competition 2014 [64]

  • FIRST EXPERIMENT In this experiment, our proposed framework LSHGWOBRNN will be tested against seven classifiers [71] over nine highly imbalanced data sets, over two sub experiments, without preprocessing, and with Locality Sensitive Hashing (LSH)-SMOTE preprocessing in terms of area under the ROC curve (AUC) and Mean Square Error (MSE) (Local Optima Avoidance)

Read more

Summary

Introduction

The rapid growth of smart devices, internet of things, smart cities and massive number of sensors networks are leading the world to be flooded by a gigantic amount of data generated from numerous sources, such as social networks, sensor networks data, video broadcasting sites, bioinformatics, internet marketing and more. Extracting knowledge from such vast data sets is considered as one of the biggest challenges for most of traditional machine learning techniques [1]. The classifier could report a very good performance on the majority class but, on the other hand, it could report a very bad performance on the minority class, since they consider a balanced data distribution

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.