Synchronous Weight Quantization-Compression for Low-Bit Quantized Neural Network

Yuzhong Jiao,Sha Li,Yiu Kei Li,Xiao Huo

doi:10.1109/ijcnn52387.2021.9533393

Abstract

Deep neural networks (DNNs) usually have multiple layers and thousands of trainable parameters to ensure high accuracy. Due to the requirement of large amounts of computation and memory, these networks are not suitable for real-time and resource-constrained mobile or embedded systems. Various techniques such as network pruning, weight sharing, network quantization, and weight encoding have been proposed to improve computational and memory efficiency. This paper presents a synchronous weight quantization-compression (SWQC) technique to compress the weights of low-bit quantized neural network (QNN). Specifically, it quantizes the weights not strictly according to their values but based on compression efficiency and their probabilities of being different quantized results. In the process of weight quantization, the compression efficiency of weights is considered as an important factor. With the help of retraining, a high compression rate and accuracy can be achieved. Verification is performed on 4-bit QNNs using the MNIST and CIFAR10 datasets. Results show that no classification accuracy is lost when the compression rate approaches 5.4X and 4.4X for the two datasets, respectively. The compression rate of the MNIST experiment is increased to 12.1X with a 1% accuracy drop, while the CIFAR10 experiment achieves a compression rate of 5.6X with the accuracy drop of about 0.6%.

Full Text