Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Kyeonglok Kim,Hyeonsu Lee,Euiseong Seo,Seungmin Oh

doi:10.1109/access.2022.3184692

Kyeonglok Kim, Hyeonsu Lee + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3184692

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 1	License type: CC BY-NC-ND 4.0

Affiliation: Sungkyunkwan University

Abstract

In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users’ demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneity-aware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Framework for Distributed Deep Neural Network Training with Heterogeneous Computing Platforms
Bontak Gu ... Arslan Munir
-
Bontak Gu, et. al.Bontak Gu ... Arslan Munir
01 Dec 2019
01 Dec 2019

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters
Nawras Alnaasan ... Hari Subramoni
-
Nawras Alnaasan, et. al.Nawras Alnaasan ... Hari Subramoni
01 Dec 2022
01 Dec 2022

DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks
Qing Ye ... Jiancheng Lv
IEEE Transactions on Emerging Topics in Computational Intelligence | VOL. 7
Qing Ye, et. al.Qing Ye ... Jiancheng Lv
01 Aug 2023
IEEE Transactions on Emerging Topics in Computational Intelligence | VOL. 7

Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems
Fei Dai ... Haibo Zhang
-
Fei Dai, et. al.Fei Dai ... Haibo Zhang
21 Feb 2023
21 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Abstract

Talk to us

Similar Papers

More From: IEEE Access