Abstract

Convolutional neural network (CNN) is an essential model to achieve high accuracy in various machine learning applications, such as image recognition and natural language processing. One of the important issues for CNN acceleration with high energy efficiency and processing performance is efficient data reuse by exploiting the inherent data locality. In this paper, we propose a novel CGRA (Coarse Grained Reconfigurable Array) architecture with time-domain multithreading for exploiting input data locality. The multithreading on each processing element enables the input data reusing through multiple computation periods. This paper presents the accelerator design performance analysis of the proposed architecture. We examine the structure of memory subsystems, as well as the architecture of the computing array, to supply required data with minimal performance overhead. We explore efficient architecture design alternatives based on the characteristics of modern CNN configurations. The evaluation results show that the available bandwidth of the external memory can be utilized efficiently when the output plane is wider (in earlier layers of many CNNs) while the input data locality can be utilized maximally when the number of output channel is larger (in later layers).

Highlights

  • Convolutional neural networks (CNNs) are attracting much attention by achieving high accuracy in various applications such as image recognition, natural language processing, object detection

  • We proposed a CGRA architecture with time-domain multithreading for exploiting input data locality

  • Our evaluation proved that the proposed architecture is suitable for both deeper and shallower layers in general CNNs

Read more

Summary

Introduction

Convolutional neural networks (CNNs) are attracting much attention by achieving high accuracy in various applications such as image recognition, natural language processing, object detection. K. Ando et al 150 weights of the convolution are self-obtained through the training of the network. Ando et al 150 weights of the convolution are self-obtained through the training of the network This “trainable” feature extraction algorithm is the key of high recognition accuracy of the CNNs. CNNs are computationally intensive, many kinds of hardware acceleration such as GPGPU computation or ASIC/FPGA-based implementations [1] [2] have been utilized to process CNNs with an acceptable throughput and efficiency. FPGA-based implementations are used to balance the performance and availability with the optimized architectures for the demand, but its effective power efficiency is not so high

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call