RCM: A Remote Cache Management Framework for Spark

Yixin Song,Han Li,Xin He,Bohan Li,Rui Zhai,Jinjiang Wang,Junyang Yu

doi:10.3390/app122211491

Abstract

With the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data. In addition, the placement of cache data is random, which lacks a measure to find efficient cache servers. Focusing on the above problems, a remote cache management framework (RCM) for the Spark platform was proposed, including a cache weight generation module (CWG), cache replacement module (CREP), and cache placement module (CPL). CWG establishes initial weights from three main factors: the response time of the query database, the number of queries, and the data size. Then, CWG reduces the old data weight through a time loss function. CREP promises that the sum of cache data weights is maximized by a greedy strategy. CPL allocates the best cache server for data based on the Kuhn-Munkres matching algorithm to improve cooperation efficiency. To verify the effectiveness of RCM, RCM is implemented on Redis and deployed on eight computing nodes and four cache servers. Three groups of benchmark jobs, PageRank, K-means and WordCount, is tested. The result of experiments confirmed that compared with MCM, SACM and DMAOM, the execution time of RCM is reduced by 42.1% at most.

Full Text