Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

Liancheng Jia,Xiuhong Li,Shengen Yan,Liqiang Lu,Yun Liang

doi:10.1109/tc.2020.2973144

Liancheng Jia, Xiuhong Li + Show 3 more

Open Access

https://doi.org/10.1109/tc.2020.2973144

Copy DOI

Abstract

Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed into special domains then perform element-wise multiplication, which can be transformed into batched GEMM operation. Different stages of computation contain multiple tasks with different computation and memory behaviors, and they share intermediate data, which provides the opportunity to fuse these tasks into a monolithic kernel. But traditional kernel fusion suffers from the problem of insufficient shared memory, which limits the performance. In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel. GPU thread blocks are assigned with different computation tasks and we design a mapping algorithm to assign tasks to thread blocks. We build a scheduler which fetches and executes the tasks following the dependency relationship. Evaluation of modern CNNs shows that our techniques achieve an average of 1.25X and 1.7X speedup compared to cuDNN's two implementations on Winograd convolution algorithm.

Highlights

CONVOLUTIONAL neural networks (CNNs) are the state-ofthe-art solution of image classification, detection and many other computer vision tasks [1], [2], [3]
We propose a novel kernel fusion technique for Winograd convolution algorithm on GPUs based on megakernel
For convolution operation with C input channels, K filters, and images are batched with size N, the output tensor O is given by the following formula

Summary

Introduction

CONVOLUTIONAL neural networks (CNNs) are the state-ofthe-art solution of image classification, detection and many other computer vision tasks [1], [2], [3]. The convolutional layer occupies more than 90 percent of the total computation in many popular neural networks [4]. Hardware accelerators such as GPUs, FPGAs, and ASICs have been employed to deal with the overwhelming computation pressure [5], [6], [7]. For convolution operation with C input channels, K filters, and images are batched with size N, the output tensor O is given by the following formula.

Objectives

Methods

Findings

Conclusion