Fast Convolutional Neural Networks with Fine-Grained FFTs

Yulin Zhang,Xiaoming Li

doi:10.1145/3410463.3414642

Abstract

The Convolutional Neural Networks (CNNs) architecture is one of the most widely used deep learning tools. The execution time of CNNs is dominated by the time spent on the convolution steps. Most CNNs implementations adopt an approach that lowers the convolution into a matrix-based operation through the im2col (image to column) process. The transformed convolution then can be easily parallelized with highly efficient BLAS libraries. The contribution of this paper is that we observe significant but intricately patterned data redundancy in this matrix representation of convolution. This redundancy has not been exploited before to improve the performance of CNNs. In this paper, we analyze the origin of the redundancy generated by the im2col process, and reveal a new data pattern to more mathematically concisely describe the matrix representation for convolution. Based on this redundancy-minimized matrix representation, we implement a FFT-based convolution with finer FFT granularity. It achieves on average 23% and maximum 50% speedup over the regular FFT convolution, and on average 93% and maximum 286% speedup over the Im2col+GEMM method from NVIDIA's cuDNN library, one of the most widely used CNNs libraries. Moreover, by replacing existing methods with our new convolution method in a popular deep-learning programming framework Caffe, we observe on average 74% speedup for multiple synthetic CNNs in closer-to-real-world application scenarios and 25% speedup for a variant of the VGG network.

Full Text