Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Gingfung Yeung,Peter Garraghan,Adrian Friday,Richard Harper,Damian Borowiec,Renyu Yang

doi:10.1109/tpds.2021.3079202

Abstract

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model's computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5 percent for GPU resource utilization, 23.7-30.7 percent for makespan reduction and 68.3 percent in job wait time reduction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Jan 1, 2022
Citations: 38

Similar Papers

Horus: An Interference-Aware Resource Manager for Deep Learning Systems
Gingfung Yeung ... Richard Harper
-
Gingfung Yeung, et. al.Gingfung Yeung ... Richard Harper
01 Jan 2020
01 Jan 2020

Scheduling CPU for GPU-based Deep Learning Jobs
Wencong Xiao ... Quanlu Zhang
-
Wencong Xiao, et. al.Wencong Xiao ... Quanlu Zhang
11 Oct 2018
11 Oct 2018

Abstract 184: The utility of deep metric learning for breast cancer identification on mammographic images
Justin Du ... Sanjay Aneja
Cancer Research | VOL. 81
Justin Du, et. al.Justin Du ... Sanjay Aneja
01 Jul 2021
Cancer Research | VOL. 81

Continuous Training and Deployment of Deep Learning Models
Ioannis Prapas ... Alireza Rezaei Mahdiraji
Datenbank-Spektrum | VOL. 21
Ioannis Prapas, et. al.Ioannis Prapas ... Alireza Rezaei Mahdiraji
01 Nov 2021
Datenbank-Spektrum | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems