Abstract
A lack of memory can lead to job failures or increase processing times for garbage collection. However, if too much memory is provided, the processing time is only marginally reduced, and most of the memory is wasted. Many big data processing tasks are executed in cloud environments. When renting virtual resources in a cloud environment, it is necessary to pay the cost according to the specifications of resources (i.e., the number of virtual cores and the size of memory), as well as rental time. In this paper, given the type of workload and volume of the input data, we analyze the memory usage pattern and derive the efficient memory size of data-parallel workloads in Apache Spark. Then, we propose a machine-learning-based prediction model that determines the efficient memory for a given workload and data. To determine the validity of the proposed model, we applied it to data-parallel workloads which include a deep learning model. The predicted memory values were in close agreement with the actual amount of required memory. Additionally, the whole building time for the proposed model requires a maximum of 44% of the total execution time of a data-parallel workload. The proposed model can improve memory efficiency up to 1.89 times compared with the vanilla Spark setting.
Highlights
Big data analysis applications [1] are executed on distributed parallel processing environments, where multiple worker nodes can perform tasks simultaneously by partitioning big data into multiple blocks
The data-parallel model is suitable for processing big data because multiple worker nodes can independently process data partitions allocated by a master node
Note that the runtime memory usage following garbage collection (GC) is measured for all GCs that occur while processing W, and the maximum value of all trials is considered as the maximum unrecoverable memory size
Summary
Big data analysis applications [1] are executed on distributed parallel processing environments, where multiple worker nodes can perform tasks simultaneously by partitioning big data into multiple blocks. Spark provides a wide range of libraries for various types of workload, such as machine learning [4], streaming data processing [5], and query processing [6] in distributed environments, as well as the existing MapReduce-based algorithms Overall, it is improving on many aspects of the first-generation platforms. This paper proposes a memory usage model of data-parallel workloads that considers the characteristics of data, workloads, and system environments in the generalpurpose distributed-processing Spark platform. Based on the memory usage model, we propose the memory prediction model for estimating efficient amounts of memory of data-parallel workloads using machine learning techniques Spark environment, it estimates the appropriate amount of memory at a maximum of.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have