Abstract

Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of 0.17 mm 2 . It can achieve 7.03 TOPS/W energy efficiency and 4.14 TOPS/mm 2 area efficiency at 100.1 mW, which makes it a promising design for the embedded devices.

Highlights

  • Convolutional neural networks (CNNs) have been widely applied in a variety of domains [1,2,3], and achieve great performance in many tasks including image recognition [4], speech processing [5]and natural language processing [6]

  • We propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture, accompanied by different optimization strategies, aiming at reducing the power consumption and area overhead, and improving the throughput

  • We propose a novel coarse grain task partitioning (CGTP) strategy to minimize the processing time of each computing unit based on the parallel streaming architecture, which can improve the throughput.Besides, the multi-pattern dataflows are designed for the different sizes of CNN

Read more

Summary

Introduction

Convolutional neural networks (CNNs) have been widely applied in a variety of domains [1,2,3], and achieve great performance in many tasks including image recognition [4], speech processing [5]and natural language processing [6]. The problems stimulate the research of both algorithms and hardware designs to pursue low power and high throughput In terms of the former, one approach is to compress the model by pruning the redundant connections, resulting in sparse neural network [8]. It bears supplementary loads including the pruning, Huffman coding, and decoding. The CONV layers bear the most intensive computation and apply the filters on the input feature maps to extract embedded visual characteristics and generate the output feature maps. In FCN layers, the features can be represented as a vector, each node of this vector is connected to all nodes of the previous layer, which explains the huge amount of parameters and the requirement of high bandwidth

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call