A Multithreaded CGRA for Convolutional Neural Network Processing

Kota Ando,Masayuki Ikebe,Tetsuya Asai,Masato Motomura,Shinya Takamaeda-Yamazaki

doi:10.4236/cs.2017.86010

Kota Ando, Masayuki Ikebe + Show 3 more

Open Access

https://doi.org/10.4236/cs.2017.86010

Copy DOI

Journal: Circuits and Systems	Publication Date: Jan 1, 2017
Citations: 18	License type: CC BY 4.0

Affiliation: Hokkaido University

Abstract

Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).

Highlights

Convolutional neural networks (CNNs) are attracting much attention by achieving high accuracy in various applications such as image recognition, natural language processing, object detection
We proposed a CGRA architecture with time-domain multithreading for exploiting input data locality
Our evaluation proved that the proposed architecture is suitable for both deeper and shallower layers in general CNNs

Summary

Introduction

Convolutional neural networks (CNNs) are attracting much attention by achieving high accuracy in various applications such as image recognition, natural language processing, object detection. K. Ando et al 150 weights of the convolution are self-obtained through the training of the network. Ando et al 150 weights of the convolution are self-obtained through the training of the network This “trainable” feature extraction algorithm is the key of high recognition accuracy of the CNNs. CNNs are computationally intensive, many kinds of hardware acceleration such as GPGPU computation or ASIC/FPGA-based implementations [1] [2] have been utilized to process CNNs with an acceptable throughput and efficiency. FPGA-based implementations are used to balance the performance and availability with the optimized architectures for the demand, but its effective power efficiency is not so high

Objectives

Results

Discussion

Conclusion