Abstract

In big data parallel computing framework, I/O throughput dominates the performance, especially for data intensive workloads. As an outstanding solution of parallel computation, Spark cache Resilient Distribution Datasets (RDDs) in different nodes to speed up the process of computation. However, Spark does not have a good strategy to select reasonable RDDs to cache their partitions in limited memory. This paper proposed a novel cache management strategy, which comprised of Selection Algorithm and Cluster Global Cleanup algorithm (CSC). Selection Algorithm caching RDD’s partitions in memory automatically order by the number of use for RDDs to speeds up data intensive computations. Nevertheless, when lots of new RDDs are chosen to cache in memory while limited memory has been out of mana, the system will adopt the Least Recently Used (LRU) replacement algorithm. LRU is not only a cleaner of worker but also a trigger of CSC. According to “full or nothing” property of parallel computation, CSC recycling wasted memory space of other workers. Experiment results show that Spark with our selection algorithm speeds up the data intensive workloads and CSC Contribute to the improvement of memory utilization.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.