Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

Marc Jorda,Antonio J Pena,Pedro Valero-Lara

doi:10.1109/access.2019.2918851

Abstract

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.

Highlights

Deep neural networks (DNNs) have received considerable attention in recent years due to their outstanding results in applications such as image classification and segmentation, natural language understanding, or speech recognition [14], [17], [20]
We evaluate vendor-provided implementations of these algorithms in the most recent high-end computing and deep learning platform based on graphics processing units (GPUs) technology. CUDA Deep Neural Network library (cuDNN), provided by NVIDIA as a fine-tuned library for its GPUs, is supported by most deep learning frameworks used in production, such as TensorFlow [1], PyTorch [26], or Caffe2 [8]
The results we present in this paper were obtained using the latest NVIDIA GPU for high-performance computing (Tesla V100), which is well-suited for computing convolutions, using the most recent versions of CUDA and cuDNN (9.1 and 7.1, respectively)

Summary

INTRODUCTION

Deep neural networks (DNNs) have received considerable attention in recent years due to their outstanding results in applications such as image classification and segmentation, natural language understanding, or speech recognition [14], [17], [20]. M. Jordà et al.: Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs layers use a much smaller number of weights, which are shared by all output computations. In terms of benefits compared to fully-connected layers, convolutional layers feature reduced storage and computational costs, and these costs no longer depend on the input and output size Instead, these layers are defined by a set of hyperparameters (like the size of the 2D tiles of weights), which are defined by the designer of the CNN. To the best of our knowledge, this article presents the first in-depth performance analysis of all available implementations of convolution algorithms in the latest NVIDIA platform The rest of this manuscript is distributed as follows.

BACKGROUND

GUIDELINES

Findings

CONCLUSIONS