Memory Management Approaches in Apache Spark: A Review

Maha Dessokey,Sherif M Saif,Sameh Salem,Elsayed Saad,Hesham Eldeeb

doi:10.1007/978-3-030-58669-0_36

Abstract

In the era of Big Data, processing large amounts of data through data-intensive applications, is presenting a challenge. An in-memory distributed computing system; Apache Spark is often used to speed up big data applications. It caches intermediate data into memory, so there is no need to repeat the computation or reload data from disk when reusing these data later. This mechanism of caching data in memory makes Apache Spark much faster than other systems. When the memory used for caching data is full, the cache replacement policy used by Apache Spark is the Least Recently Used (LRU), however LRU algorithm performs poorly in some workloads. This review is going to give an insight about different replacement algorithms used to address the LRU problems, categorize the different selection factors and provide a comparison between the algorithms in terms of selection factors, performance and the benchmarks used in the research.

Full Text