Intermediate data placement and cache replacement strategy under Spark platform

Chunlin Li,Yong Zhang,Youlong Luo

doi:10.1016/j.jpdc.2022.01.020

Abstract

Spark is widely used due to its high performance caching mechanism and high scalability, which still causes uneven workloads and produces useless intermediate caching results when faced with data-intensive applications. A data placement strategy based on an improved reservoir sampling algorithm is proposed to solve the problem of intermediate data tilt in the shuffle stage of Spark. Compared with the traditional sampling algorithm, the amount of intermediate data is accumulated while sampling. The data skew measurement model is used to classify data into skewed data, and non-skewed and coarse-grained, and fine-grained placement algorithms are designed. To further improve Spark's system memory utilization and cache hit rate, an adaptive cache replacement algorithm is proposed to maximize cache gain. We analyze the operational dependencies and propose a cache gain model. Compared with the traditional method, the two known and unknown job arrival rates are considered separately to obtain an online adaptive cache replacement strategy that maximizes cache gain. Experimental results show that our data placement strategy effectively reduces Spark applications' execution time and improves the load balance of reduce tasks. Meanwhile, the proposed adaptive cache replacement strategy effectively reduces Spark's average completion time and improves the memory utilization and cache hit rate.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Intermediate data placement and cache replacement strategy under Spark platform

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Journal: Journal of Parallel and Distributed Computing	Publication Date: Jan 31, 2022
Citations: 23

Similar Papers

ODCP: Optimizing Data Caching and Placement in Distributed File System Using Erasure Coding
Shuhan Wu ... Yunchun Li
-
Shuhan Wu, et. al.Shuhan Wu ... Yunchun Li
01 Jan 2020
01 Jan 2020

A Survey on Data Placement Strategies for Cloud based Scientific Workflows
Lalitha Singh ... Jyoti Malhotra
International Journal of Computer Applications | VOL. 141
Lalitha Singh, et. al.Lalitha Singh ... Jyoti Malhotra
17 May 2016
International Journal of Computer Applications | VOL. 141

Towards Intelligent Data Placement for Scientific Workflows in Collaborative Cloud Environment
Xin Liu ... Anwitaman Datta
-
Xin Liu, et. al.Xin Liu ... Anwitaman Datta
01 May 2011
01 May 2011

Dynamic data replication and placement strategy in geographically distributed data centers
Laila Bouhouch ... Claude Tadonki
Concurrency and Computation: Practice and Experience | VOL. 35
Laila Bouhouch, et. al.Laila Bouhouch ... Claude Tadonki
01 Feb 2022
Concurrency and Computation: Practice and Experience | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Intermediate data placement and cache replacement strategy under Spark platform

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing