A High-Accuracy Hardware-Efficient Multiply–Accumulate (MAC) Unit Based on Dual-Mode Truncation Error Compensation for CNNs

Song-Nien Tang,Yu-Shin Han

doi:10.1109/access.2020.3040366

Abstract

This paper presents a multiply–accumulate (MAC) unit that enables a dual-mode truncation error compensation (TEC) scheme based on a fixed-width Booth multiplier (FWBM) for convolutional neural network (CNN) inference operations. The proposed tailored TEC schemes of Modes 1 and 2 can achieve high MAC accuracy for a general or rectified linear unit-based CNN model with general (Mode 1) or positive/zero (Mode 2) input patterns. By pre-calculating the pre-known CNN model coefficients, the proposed dual-mode TEC scheme can be realized using minimal partial product operations with high hardware efficiency using a software–hardware codesign approach. Further, a reconfigurable architecture of the resultant MAC unit is presented to realize the proposed dual-mode TEC scheme. By evaluating the accuracy for 9- $N$ and 25- $N$ MAC operations ( $N$ denotes the number of times MAC is performed), a MAC operation using the proposed TEC scheme can achieve the highest accuracy for Modes 1 and 2, relative to contrast samples that directly employ the FWBM with a conventional TEC function. The hardware performances of 9- $N$ and 25- $N$ MAC units are also evaluated using the TSMC 40-nm standard cell library. Compared with the contrast TEC-enabled designs, the proposed MAC unit exhibits higher hardware efficiency in terms of area, delay, and power consumption and achieves a minimum reduction of more than 40% in both area-delay-error and power-delay-error products. Moreover, the resultant 9- $N$ and 25- $N$ MAC units are verified using a system-on-chip field-programmable gate array platform to test a CNN model for handwritten digit classification.

Highlights

A convolutional neural network (CNN) is a popular group of deep learning models that has shown considerable performance in many applications, such as image, signal processing, pattern recognition, and computer vision
Considerable research has been conducted on hardware (HW) acceleration schemes using application-specific integrated circuits or fieldprogrammable gate array (FPGA) units, which can perform efficient client-side CNN inference computations [3]–[8]
MAPPING RESULTS FOR THE BOOTH ENCODER AND PARTIAL PRODUCTS

Summary

INTRODUCTION

A convolutional neural network (CNN) is a popular group of deep learning models that has shown considerable performance in many applications, such as image, signal processing, pattern recognition, and computer vision. To reduce HW costs for the operations associated with the truncated L-bit LSBs of a full-width product, the concept of truncation error compensation (TEC) has been proposed for fixed-width multiplier designs [17]–[32]. We aim to design a FWBM-based MAC unit for CNN inference operations and propose the corresponding TEC schemes. A TEC scheme of Mode 1 provides an ensemble highaccuracy TEC function to improve the overall MAC operation accuracy with general values input patterns. A TEC scheme of Mode 2 provides a high-accuracy TEC function adaptable to positive or zero value input patterns for MAC operations in CNNs using the rectified linear unit (ReLU) activation function. The design extension based on the proposed Mode 2 TEC scheme supports the MAC unit design for general applications that are not targeted only for CNN operations in the edge end.

BACKGROUND

L 2 j 1

CONCLUSION