A gray-box performance model for Apache Spark

Zemin Chao,Shengfei Shi,Hong Gao,Jizhou Luo,Hongzhi Wang

doi:10.1016/j.future.2018.06.032

Abstract

Apache Spark is a powerful open source data processing platform. It is getting more and more popular with the growing need of processing massive amounts of data. A performance prediction model not only helps administrators to have a better understanding of system behavior, but also is useful in performance tuning. However, considering the complex application processing mechanism of Spark, it is not an easy job to model the relationship between system performance and configuration settings.In this paper, we present a gray-box performance model for Spark applications based on machine learning algorithms. Given a specific Spark application, the size of its input data and some key system parameters, this performance model is able to forecast its execution time according to history information. To achieve better accuracy, our model takes basic hardware information and the resource allocation strategy of Spark into consideration.In our experiments, result shows our gray-box model is better than typical black-box approaches in most of the cases. We consider this model is helpful for further researches on Apache Spark.

Full Text