OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

Roberto L Castro,Basilio B Fraguela,Diego Andrade

doi:10.3390/math9172033

Roberto L Castro, Basilio B Fraguela + Show 1 more

Open Access

https://doi.org/10.3390/math9172033

Copy DOI

Journal: Mathematics	Publication Date: Aug 24, 2021
Citations: 1	License type: CC BY 4.0

Affiliation: University of A Coruña

Abstract

Improving the performance of the convolution operation has become a key target for High Performance Computing (HPC) developers due to its prevalence in deep learning applied mainly to video processing. The improvement is being pushed by algorithmic and implementation innovations. Algorithmically, the convolution can be solved as it is mathematically enunciated, but other methods allow to transform it into a Fast Fourier Transform (FFT) or a GEneral Matrix Multiplication (GEMM). In this latter group, the Winograd algorithm is a state-of-the-art variant that is specially suitable for smaller convolutions. In this paper, we present openCNN, an optimized CUDA C++ implementation of the Winograd convolution algorithm. Our approach achieves speedups of up to 1.76× on Turing RTX 2080Ti and up to 1.85× on Ampere RTX 3090 with respect to Winograd convolution in cuDNN 8.2.0. OpenCNN is released as open-source software.

Highlights

The use of GPUs in machine learning is generating a tremendous innovation boost, specially in areas like computer vision [1]
The CUDA 11.2 Toolkit has been used on two platforms: NVIDIA Turing RTX 2080Ti and NVIDIA Ampere RTX 3090 GPUs
ResNet is a powerful Convolutional Neural Networks (CNNs) model used in many computer vision (CV) problems [13]

Summary

Introduction

The use of GPUs in machine learning is generating a tremendous innovation boost, specially in areas like computer vision [1]. CNNs have achieved state-of-the-art accuracy in many areas related to computer vision, being able to pinpoint the expected result and to overcome human precision in many cases. As their name suggests, the convolution operation is the core of this type of networks. CNNs architectures are continuously growing in depth and width, their training and inference are causing the computational cost to increase [5]. For this reason, trying to optimize the performance of the convolution operation is the target of recent research works.

Objectives

Results

Conclusion