Abstract

Due to the high throughput and high computing capability of convolutional neural networks (CNNs), researchers are paying increasing attention to the design of CNNs hardware accelerator architecture. Accordingly, in this paper, we propose a block parallel computing algorithm based on the matrix transformation computing algorithm (MTCA) to realize the convolution expansion and resolve the block problem of the intermediate matrix. It enables high parallel implementation on hardware. Moreover, we also provide a specific calculation method for the optimal partition of matrix multiplication to optimize performance. In our evaluation, our proposed method saves more than 60% of hardware storage space compared with the im2col(image to column) approach. More specifically, in the case of large-scale convolutions, it saves nearly 82% of storage space. Under the accelerator architecture framework designed in this paper, we realize the performance of 26.7GFLOPS-33.4GFLOPS (depending on convolution type) on FPGA(Field Programmable Gate Array) by reducing bandwidth and improving data reusability. It is 1.2×–4.0× faster than memory-efficient convolution (MEC) and im2col, respectively, and represents an effective solution for a large-scale convolution accelerator.

Highlights

  • At present, convolutional neural networks (CNNs) are widely used in image classification [1], target recognition [2,3], and semantic segmentation [4,5]

  • In order to solve this problem, accelerator design schemes based on various platforms are proposed, such as graphics processing unit (GPU) [7], or customized application specific integrated circuits (ASIC) [8,9,10,11], field programmable gate arrays (FPGA), and other hardware to complete the acceleration of CNNs

  • We propose a CNNs accelerator design using a matrix transformation computing algorithm (MTCA) [15] decomposition algorithm

Read more

Summary

Introduction

CNNs are widely used in image classification [1], target recognition [2,3], and semantic segmentation [4,5]. CNNs are essentially composed of a convolution layer, RELU layer In the CNNs model, the convolution layer’s amount of calculations accounts for more than 85% of the total calculation [6], which brings huge workload. The scheme of CNNs based on software cannot meet the current high-speed application requirements. In order to solve this problem, accelerator design schemes based on various platforms are proposed, such as graphics processing unit (GPU) [7], or customized application specific integrated circuits (ASIC) [8,9,10,11], field programmable gate arrays (FPGA), and other hardware to complete the acceleration of CNNs. due to the problems of power consumption, development cost, and cycles, the research and development of GPU and ASIC are largely limited

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.