Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

Qinghong Shen,Yang Gao,Zhonghai Lu,Xingwei Zhu,Zidi Qin,Yinghuan Shi,Xuan Chen,Li Li,Di Zhu,Hongbing Pan

doi:10.3390/electronics8010078

Qinghong Shen, Yang Gao + Show 8 more

Open Access

https://doi.org/10.3390/electronics8010078

Copy DOI

Abstract

As a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artificial intelligence applications. However, there are many parameters in FC layers, so the efficient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compression approach combining block-circulant matrix-based weight representation and power-of-two quantization. Applying block-circulant matrices in FC layers can reduce the storage complexity from O ( k 2 ) to O ( k ) . By quantizing the weights into integer powers of two, the multiplications in the reference can be replaced by shift and add operations. The memory usages of models for MNIST, CIFAR-10 and ImageNet can be compressed by 171 × , 2731 × and 128 × with minimal accuracy loss, respectively. A configurable parallel hardware architecture is then proposed for processing the compressed FC layers efficiently. Without multipliers, a block matrix-vector multiplication module (B-MV) is used as the computing kernel. The architecture is flexible to support FC layers of various compression ratios with small footprint. Simultaneously, the memory access can be significantly reduced by using the configurable architecture. Measurement results show that the accelerator has a processing power of 409.6 GOPS, and achieves 5.3 TOPS/W energy efficiency at 800 MHz.

Highlights

Deep neural networks (DNNs) have been widely applied to various artificial intelligence (AI)applications [1,2,3,4] and achieve great performance in many tasks such as image recognition [5,6,7], speech recognition [8] and object detection [9]
The computation in an FC layer is performed as follows y = f (Wx + v) where x is the input activation vector; y is the output activation vector; W ∈ Ra×b is the weight matrix of this layer in which b input nodes connect with a output nodes; v ∈ Ra is the bias vector; and f (·) is the nonlinear activation function and the Rectified Linear Unit (ReLU) is widely adopted in various
The RTL of the design was implemented in Verilog and synthesized by using the Synopsys Design Compiler (DC)

Summary

Introduction

Deep neural networks (DNNs) have been widely applied to various artificial intelligence (AI)applications [1,2,3,4] and achieve great performance in many tasks such as image recognition [5,6,7], speech recognition [8] and object detection [9]. Deep neural networks (DNNs) have been widely applied to various artificial intelligence (AI). To complete the tasks with higher accuracy, larger and more complex DNN models emerge and become increasingly popular. Fully-connected (FC) layers are applied in various deep learning systems. The computation in an FC layer is performed as follows y = f (Wx + v) where x is the input activation vector; y is the output activation vector; W ∈ Ra×b is the weight matrix of this layer in which b input nodes connect with a output nodes; v ∈ Ra is the bias vector; and f (·) is the nonlinear activation function and the Rectified Linear Unit (ReLU) is widely adopted in various DNN models. For many neural network architectures, the memory access is the bottleneck to process FC layers efficiently

Methods

Results

Discussion

Conclusion