Speeding Up of CGRAs by Reshaping and Stochastic FMA

Tomoya Akabe,Renyuan Zhang,Yasuhiko Nakashima

doi:10.1109/candarw53999.2021.00052

Abstract

Matrix multiplication in Coarse-grained reconfigurable arrays (CGRAs) is normally performed by vertically aligning input data and pipelining fused multiply-add (FMA) operations to achieve high efficiency. The final stage of deep convolutional neural networks (DCNNs) is equivalent to matrix multiplication, and there, input channels (ICs) are large and output channels (OCs) are small. However, there is a limit to the height of the pipeline in the CGRA, and thus it is necessary to divide the ICs. This causes an increase in the number of CGRA reboots and the execution time. The number of reboots can be reduced by reshaping the ICs horizontally and repeating FMA operations on the same unit, but SIMD is hard to use with ALUs with two 32-bit floating-point FMA units, and the throughput slows down to 4FMAs/4cycles. Recent studies show that high computational accuracy is not necessary for inference of DCNNs, and stochastic computing has been attracting attention. Stochastic computing can reduce the size of data and circuit resources and increase the frequency instead of reducing accuracy, and also facilitates horizontal accumulation. In this study, we introduced a stochastic fused multiply-add (SFMA) unit in the CGRA, which achieves a high throughput of 32FMAs/4cycles. In inference for handwritten character recognition, this method shows 94 % accuracy, reduces the circuit area by 39 %, improves the frequency by 63 %, and the number of CGRA reboots to 0.2 %, and improves the speed of inference execution by about 46 times compared to the CGRA with floating-point FMA units.

Full Text