Profiling DNN Workloads on a Volta-based DGX-1 System

Saiful A Mojumder,Amir Kavyan Ziabari,Yifan Sun,David Kaeli,John Kim,Jose L Abellan,Marcia S Louis,Ajay Joshi

doi:10.1109/iiswc.2018.8573521

Abstract

High performance multi-GPU systems are widely used to accelerate training of deep neural networks (DNNs) by exploiting the inherently massive parallel nature of the training process. Typically, the training of DNNs in multi-GPU systems leverages a data-parallel model in which a DNN is replicated on every GPU, and each GPU performs Forward Propagation (FP), Backward Propagation (BP) and, Weight Update (WU). We analyze the WU stage that is composed of collective communication (e.g., allReduce, broadcast), which demands very efficient communication among the GPUs to avoid diminishing returns when scaling the number of GPUs in the system. To overcome this issue, different data transfer mechanisms and libraries have been introduced by NVIDIA, and adopted by high-level frameworks to train DNNs. In this work, we evaluate and compare the performance of peer-to-peer (P2P) data transfer method and NCCL library-based communication method for training DNNs on a DGX-1 system consisting of 8 NVIDIA Volta-based GPUs. We profile and analyze the training of five popular DNNs (GoogLeNet, AlexNet, Inception-v3, ResNet and LeNet) using 1, 2, 4 and 8 GPUs. We show the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of the training algorithm as well as to identify the bottlenecks in the multi-GPU system architecture. Our detailed profiling and analysis can help programmers and DNN model designers accelerate the training process in DNNs.

Full Text