Abstract

In the distributed computing framework of Spark, cross-node/rack data transfer produced by map tasks and reduce tasks are common problems resulting in performance degradation, such as prolonging of entire execution time and network congestion. To address these problems, this article utilizes the bipartite graph modelling to propose an optimal locality-aware task scheduling algorithm. By considering global optimality, the algorithm can generate the optimal scheduling solution for both the map tasks and the reduce tasks for data locality. Because of the different communication modes, this article uses a unified graph to model the map task scheduling and the reduce task scheduling respectively. Then, by calculating the communication cost matrix of tasks, we formulate an optimal task scheduling scheme to minimize overall communication cost and transform the problem as the well-known graph problem: minimum weighted bipartite matching (MWBM), which can be resolved by Kuhn-Munkres algorithm. In addition, this article proposes a locality-aware executor allocation strategy to improve the data locality further. We implement our algorithm and strategy in Spark-2.4.1 and evaluate its performance using several representative micro-benchmarks, macro-benchmarks, and HiBench benchmark suite. The experimental results verify that by reducing the network traffic and access latency, the proposed algorithm can improve the job performance substantially compared to some other task scheduling algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call