Abstract

Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.

Highlights

  • Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on

  • We evaluated Machine Learning library (MLlib) and Mahout though several experiments by increasing data up to 10 GB, in order to compare the behaviors of MLlib and Mahout according to data size on three different algorithms

  • We present some recommendations for tuning and for a specific configuration, we propose new managed parameters and we show that these managed parameters give a better total processing time than the default parameters

Read more

Summary

Introduction

Many applications generate and handle very large volumes of data, like social networking, cloud applications, public web sites, search engines, scientific simulations, data warehouse and so on. Apache Spark Apache Spark is an open source Big Data processing framework, it is designed for fast computing and easy to use. It is based on MapReduce paradigm and it uses it to a whole other level. Apache Spark provides API (application programming interface) in different programming languages including Scala, Java, Python and R [31]. This framework permit to write the concept of data transformations and machine learning algorithms using the parallelism technique.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.