Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

Qingping Li,Jingwei Xu,Chun Cao

doi:10.1145/3457913.3457936

Abstract

Deep Neural Network models are integrated as parts of many real-world software applications. Due to the huge model size and complex computation, distributed deep learning (DDL) framework aims to provide a high-quality cluster scheduler to manage DDL training jobs from both resource allocation and job scheduling. However, existing schedulers either allocate a fixed amount of resources, or lack the control over task placement, which lead less efficient training. In this paper, we propose DeepSys, a GPU cluster scheduler tailored for DDL jobs. For single model, DeepSys builds a speed model to predict accurate training speed, and a memory model for high-quality resource utilization. For job scheduling, DeepSys considers resource allocation and task placement to provide efficient job scheduling in cluster. Experiments implemented on Kubernetes in two clusters show the advantage to the compared methods by 20% - 25% and 10% - 15% on average job completion time and makespan, respectively.

Full Text