Optimizing Speculative Execution in Spark Heterogeneous Environments

Zhongming Fu,Zhuo Tang

doi:10.1109/tcc.2019.2947674

Abstract

The execution time of a stage is extended by a few slow running tasks in Spark computing environments. To tackle this so-called straggler problem, Spark adopts speculative execution mechanism under which the scheduler speculatively launch additional backup for the straggler with the hope to complete early. However, due to the characteristics of tasks and the complexity of runtime environments, the Spark original speculative execution strategy and its improved versions cannot deal with this problem effectively. In this paper, we propose a novel strategy called ETWR to improve the efficiency of speculative execution in Spark. We consider the heterogeneous environment when we around to tackle the three key points of speculative execution: straggler identification, backup node selection and effectiveness guarantee. Based on the task type classification, first, we divide the task into sub-phases and use both the process speed and progress rate within a phase to find the straggler promptly. Second, we use the Locally Weighted Regression model to estimate the execution time of the task, which will be used to calculate the task's remaining time and backup time. Third, we present iMCP model to guarantee the effectiveness of speculative tasks, which can additionally keep load balancing for nodes. Finally, the factors of fast node and better location are considered when choosing proper backup nodes. Extensive experiments show that ETWR can reduce the job execution time by 23.8 percent, and improve the cluster throughput by 33.2 percent compared with Spark-2.2.0.

Full Text