Abstract

Convolutional neural networks (CNNs) have been widely adopted for computer vision applications. CNNs require many multiplications, making their use expensive in terms of both computational complexity and hardware. An effective method to mitigate the number of required multiplications is via the Winograd algorithm. Previous implementations of CNNs based on Winograd use the 2-D algorithm $F(2 \times 2,3 \times 3)$ , which reduces computational complexity by a factor of 2.25 over regular convolution. However, current Winograd implementations only apply when using a stride (shift displacement of a kernel over an input) of 1. In this article, we presented a novel method to apply the Winograd algorithm to a stride of 2. This method is valid for one, two, or three dimensions. We also introduced new Winograd versions compatible with a kernel of size 3, 5, and 7. The algorithms were successfully implemented on an NVIDIA K20c GPU. Compared to regular convolutions, the implementations for stride 2 are 1.44 times faster for a $3 \times 3$ kernel, $2.04\times $ faster for a $5\times 5$ kernel, $2.42\times $ faster for a $7 \times 7$ kernel, and $1.73\times $ faster for a $3 \times 3 \times 3$ kernel. Additionally, a CNN accelerator using a novel processing element (PE) performs two 2-D Winograd stride 1, or one 2-D Winograd stride 2, and operations per clock cycle was implemented on an Intel Arria-10 field-programmable gate array (FPGA). We accelerated the original and our proposed modified VGG-16 architectures and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/DSPs, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call