Abstract

A lack of memory can lead to job failures or increase processing times for garbage collection. However, if too much memory is provided, the processing time is only marginally reduced, and most of the memory is wasted. Many big data processing tasks are executed in cloud environments. When renting virtual resources in a cloud environment, it is necessary to pay the cost according to the specifications of resources (i.e., the number of virtual cores and the size of memory), as well as rental time. In this paper, given the type of workload and volume of the input data, we analyze the memory usage pattern and derive the efficient memory size of data-parallel workloads in Apache Spark. Then, we propose a machine-learning-based prediction model that determines the efficient memory for a given workload and data. To determine the validity of the proposed model, we applied it to data-parallel workloads which include a deep learning model. The predicted memory values were in close agreement with the actual amount of required memory. Additionally, the whole building time for the proposed model requires a maximum of 44% of the total execution time of a data-parallel workload. The proposed model can improve memory efficiency up to 1.89 times compared with the vanilla Spark setting.

Highlights

  • Big data analysis applications [1] are executed on distributed parallel processing environments, where multiple worker nodes can perform tasks simultaneously by partitioning big data into multiple blocks

  • The data-parallel model is suitable for processing big data because multiple worker nodes can independently process data partitions allocated by a master node

  • Note that the runtime memory usage following garbage collection (GC) is measured for all GCs that occur while processing W, and the maximum value of all trials is considered as the maximum unrecoverable memory size

Read more

Summary

Introduction

Big data analysis applications [1] are executed on distributed parallel processing environments, where multiple worker nodes can perform tasks simultaneously by partitioning big data into multiple blocks. Spark provides a wide range of libraries for various types of workload, such as machine learning [4], streaming data processing [5], and query processing [6] in distributed environments, as well as the existing MapReduce-based algorithms Overall, it is improving on many aspects of the first-generation platforms. This paper proposes a memory usage model of data-parallel workloads that considers the characteristics of data, workloads, and system environments in the generalpurpose distributed-processing Spark platform. Based on the memory usage model, we propose the memory prediction model for estimating efficient amounts of memory of data-parallel workloads using machine learning techniques Spark environment, it estimates the appropriate amount of memory at a maximum of.

Related Work
Data-Parallel Workloads
Memory Management Model of the Java Runtime Environment
Spark Memory Model
Memory Usage Pattern
Memory Undersupply
Memory Oversupply
Runtime Memory Profiling
Data-Parallel Characteristics in Spark
Data Characteristics
Maximum Unrecoverable Memory Estimation
Estimation Model
Model Building Methods
Experiment
Experiment Environment
Performance Metrics
Wordcount
K-Means Clustering
Logistic Regression and Neural Network
Workload Input Data Description
Prediction Accuracy
50 GB haswith an accuracy of at accuracy of at least
11. Prediction
Prediction Cost
Memory Efficiency
Findings
GB is allocated so that each worker uses
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call