Performance Trade-offs in Weight Quantization for Memory-Efficient Inference

Pablo M Tostado,Bruno U Pedroni,Gert Cauwenberghs

doi:10.1109/aicas.2019.8771473

Abstract

Over the past decade, Deep Neural Networks (DNNs) trained using Deep Learning (DL) frameworks have become the workhorse to solve a wide variety of computational tasks in big data environments. To date, DL DNNs have relied on large amounts of computational power to reach peak performance, typically relying on the high computational bandwidth of GPUs, while straining available memory bandwidth and capacity. With ever increasing data complexity and more stringent energy constraints in Internet-of-Things (IoT) application environments, there has been a growing interest in the development of more efficient DNN inference methods that economize on random-access memory usage in weight access. Herein, we present a systematic analysis of the performance trade-offs of quantized weight representations at variable bit length for memory-efficient inference in pre-trained DNN models. In this work, we vary the mantissa and exponent bit lengths in the representation of the network parameters and examine the effect of DropOut regularization during pre-training and the impact of two different weight truncation mechanisms: stochastic and deterministic rounding. We show drastic reduction in the memory need, down to 4 bits per weight, while maintaining near-optimal test performance of low-complexity DNNs pre-trained on the MNIST and CIFAR-10 datasets. These results offer a simple methodology to achieve high memory and computation efficiency of inference in DNN dedicated low-power hardware for IoT, directly from pre-trained, high-resolution DNNs using standard DL algorithms.

Full Text