Abstract
Nowadays, many data centers facilitate data processing and acquisition by developing multiple Apache Spark jobs which can be executed in private clouds with various parameters. Each job might take various application parameters which influence its execution time. Some examples of application parameters can be a selected area of interest in spatiotemporal data processing application or a time range of events in a complex event stream processing application. To predict its runtime accurately, these application parameters shall be considered during constructing its runtime model. Runtime prediction of Spark jobs allows us to schedule them efficiently in order to utilize cloud resources, increase system throughput, reduce job latency and meet customers requirements, e.g. deadlines and QoS. Also, the prediction is considered as important advantage when using a pay-as-you-go pricing model. In this paper, we present a gray box modeling methodology for runtime prediction of each individual Apache Spark job in two steps. The first one is building a white box model for predicting the input RDD size of each stage relying on prior knowledge about its behaviour and taking the application parameters into consideration. The second one is extracting a black box runtime model of each task by observing its runtime metrics according to various allocated resources and variant input RDD sizes. The modeling methodology is validated with experimental evaluation on a real-world application, and the results show a high matching accuracy which reached 83-94% of the actual runtime of the tested application.
Highlights
Big data platforms such as Hadoop, Spark or Flink are mainly used to process and analyze huge volumes of data resulting in runtimes of minutes or even hours
To the best of our knowledge, all previous modeling methodologies for runtime prediction of Spark jobs use only the data size, Spark configuration and allocated resources without the application parameters. This results in unacceptable variance between the predicted and the actual runtimes. To address this problem and improve the prediction accuracy, we present in this study a gray-box runtime modeling methodology of Apache Spark jobs that incorporates the application parameters
We presented a gray box modeling methodology for runtime prediction of Apache Spark jobs and its application for reusing intermediate results of such jobs, for which a cost-based decision model is used
Summary
Big data platforms such as Hadoop, Spark or Flink are mainly used to process and analyze huge volumes of data resulting in runtimes of minutes or even hours. Different data scientists work with the same input data They all have to apply the same preprocessing steps (data cleaning, transformation, etc.) and the different jobs share common sub-tasks which consume computing resources they all produce the same result. This sharing (e.g., by materializing intermediate results) is not done automatically and transparently. To the best of our knowledge, all previous modeling methodologies for runtime prediction of Spark jobs use only the data size, Spark configuration and allocated resources without the application parameters This results in unacceptable variance between the predicted and the actual runtimes. This methodology consists of the following two steps: 1. White-box modeling: We study the influence of each application parameter on the RDD cardinality while varying the application parameter values and estimate the RDD cardinality for each operator
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.