Improving Apache Spark's Cache Mechanism with LRC-Based Method Using Bloom Filter

Hideo Inagaki,Ryota Kawashima,Hiroshi Matsuo

doi:10.1109/candarw.2018.00096

Abstract

Memory-and-Disk caching is a common caching mechanism for temporal output in Apache Spark. However, it causes performance degradation when memory usage has reached its limit because of the Spark's LRU (Least Recently Used) based cache management. Existing studies have reported that replacement of LRU-based cache mechanism to LRC (Least Reference Count) based one that is a more accurate indicator of the likelihood of future data access. However, frequently used partitions cannot be determined because Spark accesses all of partitions for user-driven RDD operations, even if partitions do not include necessary data. In this paper, we propose a cache management method that enables allocating necessary partitions to the memory by introducing the bloom filter into existing methods. The bloom filter prevents unnecessary partitions from being processed because partitions are checked whether required data is contained. Furthermore, frequently used partitions can be properly determined by measuring the reference count of partitions. We implemented two architecture types, the driver-side bloom filter and the executor-side bloom filter, to consider the optimal place of the bloom filter. Evaluation results showed that the execution time of the driver-side implementation was reduced by 89% in a filter-test benchmark based on the LRC-based method.

Full Text