Abstract

Convolutional Neural Networks (CNNs) have been widely adopted in many kinds of artificial intelligence applications. Most of the computational overhead of CNNs is spent on convolutions. An effective approach to reducing the overhead is transforming convolutions in the time domain into multiplications in the frequency domain by means of Fast Fourier Transform (FFT) algorithms, known as FFT-based fast algorithms for convolutions. However, current FFT-based fast implementations only work for unit-strided convolutions with stride as 1, and cannot be directly applied to strided convolutions with stride size greater than 1, which are usually used as the first layer of CNNs and as an effective alternative to the pooling layers for downsampling.In this paper, we first introduce rearrangement- and sampling-based methods for applying FFT-based fast algorithms to strided convolutions, and the arithmetic complexities of these two methods and the direct method are compared in detail. Then, the highly optimized parallel implementations of the two methods on ARMv8-based many-core CPU are presented. Lastly, we benchmark these implementations against two GEMM-based implementations on this ARMv8 CPU. Our experimental results with convolutions of different kernels, feature maps, and batch sizes show that the rearrangement-based method generally exceeds the sampling-based one under the same optimizations in most cases, and these two methods can get much better performance than GEMM-based ones when the kernels, feature maps and batch sizes are large. The experimental results on the convolutional layers in popular CNNs further demonstrate the conclusions above.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call