Abstract

The goal of big data analytics is to analyze datasets with a higher amount of volume, velocity, and variety for large-scale business intelligence problems. These workloads are normally processed with the distribution on massively parallel analytical systems. Deep learning is part of a broader family of machine learning methods based on learning representations of data. Deep learning plays a significant role in the information analysis by adding value to the massive amount of unsupervised data. A core domain of research is related to the development of deep learning algorithms for auto-extraction of complex data formats at a higher level of abstraction using the massive volumes of data. In this paper, we present the latest research trends in the development of parallel algorithms, optimization techniques, tools and libraries related to big data analytics and deep learning on various parallel architectures. The basic building blocks for deep learning such as Restricted Boltzmann Machines (RBM) and Deep Belief Networks (DBN) are identified and analyzed for parallelization of deep learning models. We proposed a parallel software API based on PyTorch, Hadoop Distributed File System (HDFS), Apache Hadoop MapReduce and MapReduce Job (MRJob) for developing large-scale deep learning models. We obtained about 5-30% reduction in the execution time of the deep auto-encoder model even on a single node Hadoop cluster. Furthermore, the complexity of code development is significantly reduced to create multi-layer deep learning models.

Highlights

  • Big volumes of data have been started to accumulate based on the advancements in sensor technology, the Internet, social networks, wireless communication, and inexpensive memory in various formats such as numerical, textual, and image

  • We explored several parallel algorithms, optimization techniques, tools and libraries related to big data analytics and deep learning on various parallel architectures

  • In order to utilize these frameworks and libraries for developing large-scale deep learning models for big data analytics, we need to extend these tools to execute on multiple computing nodes where each node has a portion of input samples and runs the model in parallel

Read more

Summary

INTRODUCTION

Big volumes of data have been started to accumulate based on the advancements in sensor technology, the Internet, social networks, wireless communication, and inexpensive memory in various formats such as numerical, textual, and image Such a high volume of data can be analyzed using statistical and Computational Intelligence (CI) tools based on neuro-computing, fuzzy logic, clustering, Bayesian networks, Principal Component Analysis (PCA), etc. Deep learning is an active research area both in industry and academia to solve various practical examples such as image and speech recognition, neural machine translation, traffic management, and cancer detection It has been successfully applied in task classification, object detection, motion modeling, dimensionality reduction, and network flow prediction [3].

LITERATURE REVIEW
TensorFlow
PyTorch
Caffe2
Comparison of Deep Learning Frameworks
Customize Code Optimizations of Deep Learning Algorithms
BUILDING BLOCKS FOR DEEP LEARNING IN BIG DATA ANALYTICS
PROPOSED SOFTWARE ABSTRACTIONS FOR DEEP LEARNING MODELS
API Process Flow
API Usage
GB 1 TB
API EVALUATIONS IN TERMS OF PERFORMANCE AND USAGE
Findings
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call