Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

Gan Tong,Libo Huang,Wentao Ma,Jing Zhang,Run Yan,Ling Yang,Sheng Ma,Mengqiao Lan,Yashuai Lü,Yuanhu Cheng

doi:10.1007/978-3-031-21395-3_2

Abstract

AbstractConvolution operations are the essential components in modern CNNs (Convolutional Neural Networks), which are also the most time-consuming. Several fast convolution algorithms include FFT and Winograd, have been proposed to solve this problem. Winograd convolution is used to improve the inference performance of the convolution operators with small kernels, which are the mainstream in the current popular CNNs. However, the implementations of Winograd convolution in many highly optimized deep neural network libraries and deep learning compilers are not efficient. Due to the complex data dependencies of the four stages of Winograd convolution, it is very challenging to optimize it. In this paper, we improve the inference performance of the Winograd convolution operator on GPUs. We propose a sync-free implementation of the calculation stage of Winograd and furtherly propose methods of PKF (Partial Kernel Fusion) utilizing different memory levels of GPUs. We implemented PKF-Reconstructor based on TVM for PKF Winograd convolution. Evaluations on convolution operators from real-world CNNs show that our method achieves a speedup of 8.22\(\times \)–13.69\(\times \) compared to cuDNN and 4.89\(\times \)–9.10\(\times \) to the fastest vanilla TVM Winograd implementation.KeywordsWinograd convolutionConvolution optimizingSync-free BGEMMPartial kernel fusion

Full Text