Abstract

As social networking services and e-commerce are growing rapidly, the number of online users also dynamically growing that facilitate contribution of huge contents to digital world. In such dynamic environment, meeting the demand of computing is very challenging special with existing computing model. Although Spark is recently introduced to alleviate the problems with concept of in-memory computing for big data analytic with many parameters configuration that allow to configure and improve its performance, still it has performance bottleneck which require to investigate performance improvement mechanism by focus on the combinations of Scheduling and Shuffle Manager with data serialization with intermediate data caching options. Standalone cluster computing model was selected as experimental methodology with submit command line for data submission. Three Spark application such as WorkCount, TeraSort and PageRank were selected and developed for experiment. As a result, 2.45% and 8.01% performance improvement are achieved in OFFHEAP and Memory Only Ser data caching option, respectively.

Highlights

  • As the digital universe increase in size due to the daily transformation of everything related to people, enterprises, and environments to the digital universe, it becomes challenging to analyze it(H. Zhang et al, 2018)

  • Spark is recently introduced to alleviate the problems with the concept of in-memory computing for big data analytics with many parameter configurations that allow the configuration and improvement of performance, it has a performance bottleneck which requires investigating the performance improvement mechanism by focusing on the combination of scheduling and shuffle manager with data serialization with intermediate data caching options

  • The experimental results observed in different algorithms are summarized as follows: 1. In Spark Sort Algorithm, FIFO scheduler with Sort shuffler together with Java serialization in OffHeap show the best performance than others different combination, whereas Disk Only data caching option show good performance than other different combinations follow OffHeap data caching option

Read more

Summary

Introduction

As the digital universe increase in size due to the daily transformation of everything related to people, enterprises, and environments to the digital universe, it becomes challenging to analyze it(H. Zhang et al, 2018). As the digital universe increase in size due to the daily transformation of everything related to people, enterprises, and environments to the digital universe, it becomes challenging to analyze it(H. According to a study observed and predicted by IDC, the size of data in the digital universe will exceed around 44 zettabytes by 2020(Aggarwal et al, 2014; Tsai et al, 2018). As the use of the Internet is growing with the daily activities of users, data are growing daily with increasing data availability in a variety of features. These multidimensional features make design and analysis more complicated with intricate performance. This kind of complex data requires big data analytic which facilitates the performance of data analysis to reduce cost and facilitate instant decisions

Methods
Findings
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.