Parallelizing DNN Training on GPUs: Challenges and Opportunities

Weizheng Xu,Youtao Zhang,Xulong Tang

doi:10.1145/3442442.3452055

Abstract

In recent years, Deep Neural Networks (DNNs) have emerged as a widely adopted approach in many application domains. Training DNN models is also becoming a significant fraction of the datacenter workload. Recent evidence has demonstrated that modern DNNs are becoming more complex and the size of DNN parameters (i.e., weights) is also increasing. In addition, a large amount of input data is required to train the DNN models to reach target accuracy. As a result, the training performance becomes one of the major challenges that limit DNN adoption in real-world applications. Recent works have explored different parallelism strategies (i.e., data parallelism and model parallelism) and used multi-GPUs in datacenters to accelerate the training process. However, naively adopting data parallelism and model parallelism across multiple GPUs can lead to sub-optimal executions. The major reasons are i) the large amount of data movement that prevents the system from feeding the GPUs with the required data in a timely manner (for data parallelism); and ii) low GPU utilization caused by data dependency between layers that placed on different devices (for model parallelism). In this paper, we identify the main challenges in adopting data parallelism and model parallelism on multi-GPU platforms. Then, we conduct a survey including recent research works targeting these challenges. We also provide an overview of our work-in-progress project on optimizing DNN training on GPUs. Our results demonstrate that simple-yet-effective system optimizations can further improve the training scalability compared to prior works.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Parallelizing DNN Training on GPUs: Challenges and Opportunities

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An In-Depth Analysis of Distributed Training of Deep Neural Networks
Yunyong Ko ... Sang-Wook Kim
-
Yunyong Ko, et. al.Yunyong Ko ... Sang-Wook Kim
01 May 2021
01 May 2021

Neuroevolution in Deep Neural Networks: Current Trends and Future Challenges
Edgar Galvan ... Peter Mooney
IEEE Transactions on Artificial Intelligence | VOL. 2
Edgar Galvan, et. al.Edgar Galvan ... Peter Mooney
04 May 2021
IEEE Transactions on Artificial Intelligence | VOL. 2

A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks
Xin Liu ... Zhisong Pan
Information Sciences | VOL. 612
Xin Liu, et. al.Xin Liu ... Zhisong Pan
05 Sep 2022
Information Sciences | VOL. 612

A Framework for Distributed Deep Neural Network Training with Heterogeneous Computing Platforms
Bontak Gu ... Arslan Munir
-
Bontak Gu, et. al.Bontak Gu ... Arslan Munir
01 Dec 2019
01 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallelizing DNN Training on GPUs: Challenges and Opportunities

Abstract

Talk to us

Similar Papers