Abstract

Deep Neural Network models are integrated as parts of many real-world software applications. Due to the huge model size and complex computation, distributed deep learning (DDL) framework aims to provide a high-quality cluster scheduler to manage DDL training jobs from both resource allocation and job scheduling. However, existing schedulers either allocate a fixed amount of resources, or lack the control over task placement, which lead less efficient training. In this paper, we propose DeepSys, a GPU cluster scheduler tailored for DDL jobs. For single model, DeepSys builds a speed model to predict accurate training speed, and a memory model for high-quality resource utilization. For job scheduling, DeepSys considers resource allocation and task placement to provide efficient job scheduling in cluster. Experiments implemented on Kubernetes in two clusters show the advantage to the compared methods by 20% - 25% and 10% - 15% on average job completion time and makespan, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call