Importance of Memory Management Layer in Big Data Architecture

Maha Dessokey,Sherif M Saif,Hesham Eldeeb,Sameh Salem,Elsayed Saad

doi:10.14569/ijacsa.2022.0130554

Abstract

The generation of daily massive amounts of heterogeneous data from a variety of sources presents a challenge in terms of storage and analysis capabilities and brings new problems into high-performance computing clusters. To better utilize this huge and heterogeneous data, the continuous development of advanced Big Data platforms and Big Data analytic techniques are required. One of the significant issues with in-memory Big Data processing platforms, such as Apache Spark, is the user’s responsibility to decide whether the intermediate data should be cached or not. In addition, the data may be kept in several storage systems and physically scattered over different racks, regions, and clouds. Data need to be close to the computation nodes and hence data locality issue is a challenge. In this paper, using a distinct memory management layer between the data processing layer and the data storage layer, which automatically caches data without the need for any interaction from the applications’ developers, is evaluated. K-means, PageRank and WordCount workloads from the HiBench benchmark beside a real case to predict the price of Real Estate that is implemented using Gradient Boosting Regression Tree model, are used to evaluate this framework. Experiments show that the memory management layer outperforms the Apache Spark in reducing the execution time.

Full Text