Algorithm/architecture solutions to improve beyond uniform quantization in embedded DNN accelerators

Ardavan Pedram,Ali Shafie Ardestani,Ling Li,Hamzah Abdelaziz,Jun Fang,Joseph Hassoun

doi:10.1016/j.sysarc.2022.102454

Abstract

The choice of data type has a major impact on speed, accuracy, and power consumption of deep learning accelerators. Quantizing the weights and activations of neural networks to integer based computation is an industry standard for reducing memory footprint and computation cost of inference in embedded systems. Uniform weight quantization can be used for tasks where accuracy drop can be tolerated. However, the drop in accuracy due to a uniform quantization might be non-negligible especially when performed on shallow networks, complex computer vision tasks, or with lower-bit integers.In this paper, we introduce a software and a hardware solution to improve on a baseline integer based uniform quantization so that it can be run on lower power systems and with even less bits. We also introduce a novel encoding technique on top of our software solution specific for partial sums to significantly reduce memory footprint, latency and energy consumption due to movement of partial sums. The proposed SW solution exploits non-uniform piece-wise linear quantization to improve accuracy by capturing bell shaped distribution of weights while still using INT-based computation units. The proposed partial sum encoding can be applied to the partial sums regardless of uniform or non-uniform quantization. The proposed HW solution can either combine integers to make larger integers or turn them into Floating-Point operations so that various levels can have various precisions or data types, if necessary. To do so, we studied upper limits of precision we need in our compute units to support floating point inner product operations. It turns out that we can improve upon integer IPUs to perform accurate floating-point operations without introducing large shift units or wide adder trees.Our proposed SW solution (PWLQ) achieves the state-of-the-art results on all cases and it outperforms all other methods with a large margin. The proposed partial sum encoding technique effectively compresses the partial sum of networks like Resnet-50 down to 12bits (from Int64/32) without loss in accuracy. The proposed HW architecture achieves area improvements of up to 46% in TOPS/mm2 with power efficiency improvements of up to 63% in TOPS/W when compared to state of the art mixed precision implementation.

Full Text