Optimizing makespan and resource utilization for multi-DNN training in GPU cluster

Zhongjin Li,Victor Chang,Haiyang Hu,Maozhong Fu,Jidong Ge,Francesco Piccialli

doi:10.1016/j.future.2021.06.021

Zhongjin Li, Victor Chang + Show 4 more

https://doi.org/10.1016/j.future.2021.06.021

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.

Full Text