Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options

Mesay Deleli,Deleli Mesay Adinew,Ayall Tewodros Alemu

doi:10.4018/jta.290326

Abstract

As social networking services and e-commerce are growing rapidly, the number of online users also dynamically growing that facilitate contribution of huge contents to digital world. In such dynamic environment, meeting the demand of computing is very challenging special with existing computing model. Although Spark is recently introduced to alleviate the problems with concept of in-memory computing for big data analytic with many parameters configuration that allow to configure and improve its performance, still it has performance bottleneck which require to investigate performance improvement mechanism by focus on the combinations of Scheduling and Shuffle Manager with data serialization with intermediate data caching options. Standalone cluster computing model was selected as experimental methodology with submit command line for data submission. Three Spark application such as WorkCount, TeraSort and PageRank were selected and developed for experiment. As a result, 2.45% and 8.01% performance improvement are achieved in OFFHEAP and Memory Only Ser data caching option, respectively.

Highlights

As the digital universe increase in size due to the daily transformation of everything related to people, enterprises, and environments to the digital universe, it becomes challenging to analyze it(H. Zhang et al, 2018)
Spark is recently introduced to alleviate the problems with the concept of in-memory computing for big data analytics with many parameter configurations that allow the configuration and improvement of performance, it has a performance bottleneck which requires investigating the performance improvement mechanism by focusing on the combination of scheduling and shuffle manager with data serialization with intermediate data caching options
The experimental results observed in different algorithms are summarized as follows: 1. In Spark Sort Algorithm, FIFO scheduler with Sort shuffler together with Java serialization in OffHeap show the best performance than others different combination, whereas Disk Only data caching option show good performance than other different combinations follow OffHeap data caching option

Summary

Introduction

As the digital universe increase in size due to the daily transformation of everything related to people, enterprises, and environments to the digital universe, it becomes challenging to analyze it(H. Zhang et al, 2018). As the digital universe increase in size due to the daily transformation of everything related to people, enterprises, and environments to the digital universe, it becomes challenging to analyze it(H. According to a study observed and predicted by IDC, the size of data in the digital universe will exceed around 44 zettabytes by 2020(Aggarwal et al, 2014; Tsai et al, 2018). As the use of the Internet is growing with the daily activities of users, data are growing daily with increasing data availability in a variety of features. These multidimensional features make design and analysis more complicated with intricate performance. This kind of complex data requires big data analytic which facilitates the performance of data analysis to reduce cost and facilitate instant decisions

Methods

Findings

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Technological Advancements

Lead the way for us

Journal: Journal of Technological Advancements	Publication Date: Oct 21, 2021
License type: CC BY 3.0

Similar Papers

Spark Performance Optimization Analysis with Multi-Layer Parameter using Shuffling and Scheduling in Different Data Caching Options
Deleli Mesay Adinew ... Zhou Shijie
-
Deleli Mesay Adinew, et. al.Deleli Mesay Adinew ... Zhou Shijie
26 Nov 2021
26 Nov 2021

A Novel Reinforcement Learning Approach for Spark Configuration Parameter Optimization.
Xu Huang ... Xiaomeng Zhai
Sensors | VOL. 22
Xu Huang, et. al.Xu Huang ... Xiaomeng Zhai
08 Aug 2022
Sensors | VOL. 22

How to Make Friends in Social Network Service? A Comparison between Chinese and German
Zhe Chen ... Seyed Sajed
-
Zhe Chen, et. al.Zhe Chen ... Seyed Sajed
01 Jan 2013
01 Jan 2013

The mechanism of synergetically controlled self-organization of actors in social networking services
Kateryna Molodetska
Development Management | VOL. 16
Kateryna MolodetskaKateryna Molodetska
21 Jan 2019
Development Management | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Technological Advancements