Leveraging resource management for efficient performance of Apache Spark

Khadija Aziz,Dounia Zaidouni,Mostafa Bellafkih

doi:10.1186/s40537-019-0240-1

Khadija Aziz, Dounia Zaidouni + Show 1 more

Open Access

https://doi.org/10.1186/s40537-019-0240-1

Copy DOI

Abstract

Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.

Highlights

Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on
We evaluated Machine Learning library (MLlib) and Mahout though several experiments by increasing data up to 10 GB, in order to compare the behaviors of MLlib and Mahout according to data size on three different algorithms
We present some recommendations for tuning and for a specific configuration, we propose new managed parameters and we show that these managed parameters give a better total processing time than the default parameters

Summary

Introduction

Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on. Apache Spark Apache Spark is an open source Big Data processing framework, it is designed for fast computing and easy to use. It is based on MapReduce paradigm and it uses it to a whole other level. Apache Spark provides API (application programming interface) in different programming languages including Scala, Java, Python and R [31]. This framework permit to write the concept of data transformations and machine learning algorithms using the parallelism technique.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Aug 23, 2019
Citations: 21	License type: open-access

R Discovery Prime

R Discovery Prime

Leveraging resource management for efficient performance of Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Assessing naive Bayes and support vector machine performance in sentiment classification on a big data platform
Redouane Karsi ... Jamila El Alami
IAES International Journal of Artificial Intelligence (IJ-AI) | VOL. 10
Redouane Karsi, et. al.Redouane Karsi ... Jamila El Alami
01 Dec 2021
IAES International Journal of Artificial Intelligence (IJ-AI) | VOL. 10

Large-scale data mining analytics based on MapReduce

-

01 Jan 2014
01 Jan 2014

Analysis and classification of heart diseases using heartbeat features and machine learning algorithms
Fajr Ibrahem Alarsan ... Mamoon Younes
Journal of Big Data | VOL. 6
Fajr Ibrahem Alarsan, et. al.Fajr Ibrahem Alarsan ... Mamoon Younes
31 Aug 2019
Journal of Big Data | VOL. 6

New Zealand's resource management system and its effectiveness
Keiko Nagashima ... Nobukazu Nakagoshi
Ecology and Civil Engineering | VOL. 6
Keiko Nagashima, et. al.Keiko Nagashima ... Nobukazu Nakagoshi
01 Jan 2003
Ecology and Civil Engineering | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Leveraging resource management for efficient performance of Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data