Abstract

Resource management of deep learning training (DLT) jobs is critical for cluster resource efficiency and client QoS assurance. Most existing scheduling frameworks require clients to specify job resource configuration, which can lead to over-provision or under-provision issues. Additionally, the performance of some static scheduling frameworks degrades in highly dynamic clusters. In this paper, we propose a QoS-aware joint resource optimization framework called Qore-DL for distributed DLT jobs. We divide the lifecycle of a DLT job into submission, queuing and running stages. Qore-DL automatically configures reasonable resources for submitted jobs and greedily assigns scheduled jobs to hosts. For running jobs, Qore-DL employs a heuristic scheme to adjust their resources. Qore-DL jointly considers the optimization of QoS satisfaction and resource efficiency at the three stages of DLT jobs. We implemented the prototype of Qore-DL in TensorFlow based on Kubernetes and conducted extensive experiments in CPU and GPU clusters to evaluate its performance. The experiment results show that, compared with its counterparts, Qore-DL can improve the job completion rate by up to 42.4% and the cluster resource efficiency by up to 21.8%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call