Task Scheduling for Spark Applications With Data Affinity on Heterogeneous Clusters

Xiaodong Zhang,Ruben Ruiz,Xiaoping Li,Houan Du

doi:10.1109/jiot.2022.3181997

Abstract

The Internet of Things (IoT)-enabled applications use sensors and actuators to collect big data, which are processed by big data models, e.g., Spark. Generally, data processing tasks are precedence constrained and the computation results are transmitted to other IoT devices. In this article, we consider the Spark workflow problem of scheduling tasks with data affinity to heterogeneous servers to minimize the maximum completion time. In a Spark instance, jobs are precedence constrained and stages for each job are also precedence constrained. There are a large number of topological stage orders. A balance between task execution times, determined by heterogeneous servers, and transmission times caused by data affinity is difficult to achieve. A scheduling optimization algorithm framework is proposed, which consists of five components: 1) temporal parameter calculation; 2) ready stage adding; 3) task sequencing; 4) resource allocation; and 5) schedule improvement. Strategies for each component are developed. The algorithmic components are statistically calibrated over a comprehensive set of instances. The proposed algorithm is compared to two modified classic algorithms for similar problems on typical scientific workflow instances. The experimental results demonstrate the effectiveness of the proposal for the considered problem.

Full Text