Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs

Hani Al-Sayeh,Kai-Uwe Sattler

doi:10.1109/icdew.2019.00-23

Abstract

Nowadays, many data centers facilitate data processing and acquisition by developing multiple Apache Spark jobs which can be executed in private clouds with various parameters. Each job might take various application parameters which influence its execution time. Some examples of application parameters can be a selected area of interest in spatiotemporal data processing application or a time range of events in a complex event stream processing application. To predict its runtime accurately, these application parameters shall be considered during constructing its runtime model. Runtime prediction of Spark jobs allows us to schedule them efficiently in order to utilize cloud resources, increase system throughput, reduce job latency and meet customers requirements, e.g. deadlines and QoS. Also, the prediction is considered as important advantage when using a pay-as-you-go pricing model. In this paper, we present a gray box modeling methodology for runtime prediction of each individual Apache Spark job in two steps. The first one is building a white box model for predicting the input RDD size of each stage relying on prior knowledge about its behaviour and taking the application parameters into consideration. The second one is extracting a black box runtime model of each task by observing its runtime metrics according to various allocated resources and variant input RDD sizes. The modeling methodology is validated with experimental evaluation on a real-world application, and the results show a high matching accuracy which reached 83-94% of the actual runtime of the tested application.

Highlights

Big data platforms such as Hadoop, Spark or Flink are mainly used to process and analyze huge volumes of data resulting in runtimes of minutes or even hours
To the best of our knowledge, all previous modeling methodologies for runtime prediction of Spark jobs use only the data size, Spark configuration and allocated resources without the application parameters. This results in unacceptable variance between the predicted and the actual runtimes. To address this problem and improve the prediction accuracy, we present in this study a gray-box runtime modeling methodology of Apache Spark jobs that incorporates the application parameters
We presented a gray box modeling methodology for runtime prediction of Apache Spark jobs and its application for reusing intermediate results of such jobs, for which a cost-based decision model is used

Summary

Introduction

Big data platforms such as Hadoop, Spark or Flink are mainly used to process and analyze huge volumes of data resulting in runtimes of minutes or even hours. Different data scientists work with the same input data They all have to apply the same preprocessing steps (data cleaning, transformation, etc.) and the different jobs share common sub-tasks which consume computing resources they all produce the same result. This sharing (e.g., by materializing intermediate results) is not done automatically and transparently. To the best of our knowledge, all previous modeling methodologies for runtime prediction of Spark jobs use only the data size, Spark configuration and allocated resources without the application parameters This results in unacceptable variance between the predicted and the actual runtimes. This methodology consists of the following two steps: 1. White-box modeling: We study the influence of each application parameter on the RDD cardinality while varying the application parameter values and estimate the RDD cardinality for each operator

Black-box modeling

Runtime prediction modelling

Reuse and materialization

Background

Gray‐box modeling methodology

White‐box model

Black‐box model

Building the white‐box model

Building the black‐box model

WordCount

PageRank

K‐Means

Using the prediction model for cost‐based recycling

Decision model

Evaluation

Materialization

Conclusion and Outlook

Apache spark

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Apr 1, 2019
Citations: 26	License type: cc-by

Similar Papers

A gray-box modeling methodology for runtime prediction of Apache Spark jobs
Hani Al-Sayeh ... Stefan Hagedorn
Distributed and Parallel Databases | VOL. 38
Hani Al-Sayeh, et. al.Hani Al-Sayeh ... Stefan Hagedorn
10 Mar 2020
Distributed and Parallel Databases | VOL. 38

Deadline-Aware Cost Optimization for Spark
Subhajit Sidhanta ... Supratik Mukhopadhyay
IEEE Transactions on Big Data | VOL. 7
Subhajit Sidhanta, et. al.Subhajit Sidhanta ... Supratik Mukhopadhyay
01 Mar 2021
IEEE Transactions on Big Data | VOL. 7

Balance resource allocation for spark jobs based on prediction of the optimal resource
Zhiyao Hu ... Dongsheng Li
Tsinghua Science and Technology | VOL. 25
Zhiyao Hu, et. al.Zhiyao Hu ... Dongsheng Li
01 Aug 2020
Tsinghua Science and Technology | VOL. 25

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA
Nusrat Sharmin Islam ... Dhabaleswar K Panda
-
Nusrat Sharmin Islam, et. al.Nusrat Sharmin Islam ... Dhabaleswar K Panda
01 Jun 2016
01 Jun 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs

Abstract

Highlights

Summary

Talk to us

Similar Papers