Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark

Rohyoung Myung,Sukyong Choi

doi:10.3390/sym13040697

Abstract

A lack of memory can lead to job failures or increase processing times for garbage collection. However, if too much memory is provided, the processing time is only marginally reduced, and most of the memory is wasted. Many big data processing tasks are executed in cloud environments. When renting virtual resources in a cloud environment, it is necessary to pay the cost according to the specifications of resources (i.e., the number of virtual cores and the size of memory), as well as rental time. In this paper, given the type of workload and volume of the input data, we analyze the memory usage pattern and derive the efficient memory size of data-parallel workloads in Apache Spark. Then, we propose a machine-learning-based prediction model that determines the efficient memory for a given workload and data. To determine the validity of the proposed model, we applied it to data-parallel workloads which include a deep learning model. The predicted memory values were in close agreement with the actual amount of required memory. Additionally, the whole building time for the proposed model requires a maximum of 44% of the total execution time of a data-parallel workload. The proposed model can improve memory efficiency up to 1.89 times compared with the vanilla Spark setting.

Highlights

Big data analysis applications [1] are executed on distributed parallel processing environments, where multiple worker nodes can perform tasks simultaneously by partitioning big data into multiple blocks
The data-parallel model is suitable for processing big data because multiple worker nodes can independently process data partitions allocated by a master node
Note that the runtime memory usage following garbage collection (GC) is measured for all GCs that occur while processing W, and the maximum value of all trials is considered as the maximum unrecoverable memory size

Summary

Introduction

Big data analysis applications [1] are executed on distributed parallel processing environments, where multiple worker nodes can perform tasks simultaneously by partitioning big data into multiple blocks. Spark provides a wide range of libraries for various types of workload, such as machine learning [4], streaming data processing [5], and query processing [6] in distributed environments, as well as the existing MapReduce-based algorithms Overall, it is improving on many aspects of the first-generation platforms. This paper proposes a memory usage model of data-parallel workloads that considers the characteristics of data, workloads, and system environments in the generalpurpose distributed-processing Spark platform. Based on the memory usage model, we propose the memory prediction model for estimating efficient amounts of memory of data-parallel workloads using machine learning techniques Spark environment, it estimates the appropriate amount of memory at a maximum of.

Related Work

Data-Parallel Workloads

Memory Management Model of the Java Runtime Environment

Spark Memory Model

Memory Usage Pattern

Memory Undersupply

Memory Oversupply

Runtime Memory Profiling

Data-Parallel Characteristics in Spark

Data Characteristics

Maximum Unrecoverable Memory Estimation

Estimation Model

Model Building Methods

Experiment

Experiment Environment

Performance Metrics

Wordcount

K-Means Clustering

Logistic Regression and Neural Network

Workload Input Data Description

Prediction Accuracy

50 GB haswith an accuracy of at accuracy of at least

11. Prediction

Prediction Cost

Memory Efficiency

Findings

GB is allocated so that each worker uses

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Symmetry	Publication Date: Apr 16, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry

Lead the way for us

Similar Papers

Создание виртуальных кластеров Apache Spark в облачных средах с использованием систем оркестрации
O Borisenko ... R Pastukhov
Proceedings of the Institute for System Programming of the RAS | VOL. 28
O Borisenko, et. al.O Borisenko ... R Pastukhov
01 Jan 2015
Proceedings of the Institute for System Programming of the RAS | VOL. 28

An Efficient Approach to Extract and Store Big Semantic Web Data Using Hadoop and Apache Spark GraphX
Wria Mohammed Salih Mohammed ... Alaa Khalil Ju Maa
ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal | VOL. 13
Wria Mohammed Salih Mohammed, et. al.Wria Mohammed Salih Mohammed ... Alaa Khalil Ju Maa
05 Jun 2024
ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal | VOL. 13

Review of Virtual Memory Optimization in Cloud Environment
Ye Ding ... He Xu
-
Ye Ding, et. al.Ye Ding ... He Xu
01 Sep 2016
01 Sep 2016

Big forensic data management in heterogeneous distributed systems: quick analysis of multimedia forensic data
Darren Quick ... Kim‐Kwang Raymond Choo
Software: Practice and Experience | VOL. 47
Darren Quick, et. al.Darren Quick ... Kim‐Kwang Raymond Choo
22 Jul 2016
Software: Practice and Experience | VOL. 47

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry