Abstract
Spark is widely used due to its high performance caching mechanism and high scalability, which still causes uneven workloads and produces useless intermediate caching results when faced with data-intensive applications. A data placement strategy based on an improved reservoir sampling algorithm is proposed to solve the problem of intermediate data tilt in the shuffle stage of Spark. Compared with the traditional sampling algorithm, the amount of intermediate data is accumulated while sampling. The data skew measurement model is used to classify data into skewed data, and non-skewed and coarse-grained, and fine-grained placement algorithms are designed. To further improve Spark's system memory utilization and cache hit rate, an adaptive cache replacement algorithm is proposed to maximize cache gain. We analyze the operational dependencies and propose a cache gain model. Compared with the traditional method, the two known and unknown job arrival rates are considered separately to obtain an online adaptive cache replacement strategy that maximizes cache gain. Experimental results show that our data placement strategy effectively reduces Spark applications' execution time and improves the load balance of reduce tasks. Meanwhile, the proposed adaptive cache replacement strategy effectively reduces Spark's average completion time and improves the memory utilization and cache hit rate.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.