Coordinate Cache Management for Performance Improvement in Spark

Weirong Xiu,Jie Guo,Yanrong Li

doi:10.1109/ipec49694.2020.9114961

Abstract

In big data parallel computing framework, I/O throughput dominates the performance, especially for data intensive workloads. As an outstanding solution of parallel computation, Spark cache Resilient Distribution Datasets (RDDs) in different nodes to speed up the process of computation. However, Spark does not have a good strategy to select reasonable RDDs to cache their partitions in limited memory. This paper proposed a novel cache management strategy, which comprised of Selection Algorithm and Cluster Global Cleanup algorithm (CSC). Selection Algorithm caching RDD’s partitions in memory automatically order by the number of use for RDDs to speeds up data intensive computations. Nevertheless, when lots of new RDDs are chosen to cache in memory while limited memory has been out of mana, the system will adopt the Least Recently Used (LRU) replacement algorithm. LRU is not only a cleaner of worker but also a trigger of CSC. According to “full or nothing” property of parallel computation, CSC recycling wasted memory space of other workers. Experiment results show that Spark with our selection algorithm speeds up the data intensive workloads and CSC Contribute to the improvement of memory utilization.

Full Text