Abstract

Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.

Highlights

  • Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability

  • We provide an overall perspective by evaluating several significant concepts and previous work in the big data, Spark, machine learning (ML), deep learning (DL) and cascade learning (CL) domains

  • The literature review related to this article is discussed in four categories, i.e., related work to Spark, Machine, DL with Multilayer Perceptron (MLP), Long Short-Term Memory (LSTM) and CL

Read more

Summary

Introduction

Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. Big data infrastructure has been developed to identify analytics to utilize quick, reliable, and versatile computational design, providing efficient quality attributes including flexibility, accessibility, and resource pooling with on-demand and ease-of-use [3,4]. This steadily developing requirement plays a vital role in improving massive industry data analytics frameworks. This article analyzes a more proficient and massive data processing framework, Apache Spark, a new big data processing tool for distributed computing, well-suited to iterative machine learning (ML).

Background and Related Work
Proposed
Overview of the Architecture
Support
Computation Time
Continuous Learning Improvement
Proposed Framework Implementation
Description of the Dataset
Cardiac Arrhythmia Classification
Recurrent
Identifying Malicious URLs
Experimental Setup
Stage 1 Classification Analysis
Stage 2 Classification Analysis
Method
Conclusions and Outlook
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call