Elastic Deep Learning in Multi-Tenant GPU Clusters

Yidi Wu,Zhi Liu,Xiao Yan,James Cheng,Zhenkun Cai,Han Yuan,Yuzhen Huang,Fan Yu,Kaihao Ma

doi:10.1109/tpds.2021.3064966

Abstract

We study how to support elasticity, that is, the ability to dynamically adjust the parallelism (i.e., the number of GPUs), for deep neural network (DNN) training in a GPU cluster. Elasticity can benefit multi-tenant GPU cluster management in many ways, for example, achieving various scheduling objectives (e.g., job throughput, job completion time, GPU efficiency) according to cluster load variations, utilizing transient idle resources, and supporting performance profiling, job migration, and straggler mitigation. We propose EDL, which enables elastic deep learning with a simple API and can be easily integrated with existing deep learning frameworks such as TensorFlow and PyTorch. EDL also incorporates techniques that are necessary to reduce the overhead of parallelism adjustments, such as stop-free scaling and dynamic data pipeline. We demonstrate with experiments that EDL can indeed bring significant benefits to the above-listed applications in GPU cluster management.

Full Text