Abstract

Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complexity a lot, making the convolution computing fast. But existing implementations of Winograd convolution is limited to small tiles, i.e. F(4 × 4, 3 × 3) and F(2 × 2, 3 × 3) where 4 × 4 and 2 × 2 are tile sizes of output channels and 3 × 3 is the filter size, and single precision data. In this paper, we propose an optimized mixed precision F(6 × 6, 3 × 3) Winograd convolution implementation on NVIDIA Ampere GPUs using Tensor Cores. Our experiments show that the accuracy of mixed precision F(6 × 6, 3 × 3) Winograd convolution is sufficient to infer the convolutional neural networks. Besides, our method achieves up to 15.71x and 2.41x speedup on NVIDIA Ampere A100, compared with the state of the art Winograd based convolution and GEMM based convolution in cuDNN 8.1.0, respectively. Moreover, we integrate our F(6 × 6, 3 × 3) Winograd convolution implementation into NVIDIA TensorRT, which is a C++ inference library on GPUs provided by NVIDIA, as custom layer plugins. And we build the whole VGG network model using our custom Winograd convolution layers and other layers supported by TensorRT. The experiments show that the accuracy of the whole VGG network using our F(6 × 6, 3 × 3) Winograd convolution is 71.24%, while the accuracy of using FP32 computing for the VGG network is 71.22%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call