Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Gingfung Yeung,Renyu Yang,Damian Borowiec,Adrian Friday,Peter Garraghan,Richard Harper

doi:10.1007/978-3-030-60239-0_33

Abstract

Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identified that co-location - multiple jobs co-located within the same GPU - is an effective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel profiling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-profiling, our approach estimates job resource utilization and co-location patterns to determine effective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 different models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Three Reasons Why Artificial Intelligence Might Be the Radiologist's Best Friend.
Rick R Van Rijn ... Alberto De Luca
Radiology | VOL. 296
Rick R Van Rijn, et. al.Rick R Van Rijn ... Alberto De Luca
21 Apr 2020
Radiology | VOL. 296

Clones in deep learning code: what, where, and why?
Hadhemi Jebnoun ... Md Saidur Rahman
Empirical Software Engineering | VOL. 27
Hadhemi Jebnoun, et. al.Hadhemi Jebnoun ... Md Saidur Rahman
08 Apr 2022
Empirical Software Engineering | VOL. 27

DeepXplore
Kexin Pei ... Suman Jana
Communications of the ACM | VOL. 62
Kexin Pei, et. al.Kexin Pei ... Suman Jana
24 Oct 2019
Communications of the ACM | VOL. 62

DeepXplore
Kexin Pei ... Junfeng Yang
-
Kexin Pei, et. al.Kexin Pei ... Junfeng Yang
14 Oct 2017
14 Oct 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Abstract

Talk to us

Similar Papers