Unbalanced Encoding in Synchronous Weight Quantization-Compression for Low-Bit Quantized Neural Network

Yuzhong Jiao,Yiu Kei Li,Peng Luo,Sha Li,Xiao Huo

doi:10.1109/gcaiot53516.2021.9693045

Abstract

Deep neural networks (DNNs) usually have thousands of trainable parameters to ensure high accuracy. Due to large amounts of computation and memory requirements, these networks are not suitable for real-time and resource-constrained systems. Various techniques such as network pruning, weight sharing, network quantization, and weight encoding have improved computational and memory efficiency. The synchronous weight quantization-compression (SWQC) technique applies both network quantization and weight encoding to realize weight compression in the process of network quantization. This technique generates a quantized neural network (QNN) model with a good trade-off between accuracy and compression rate by choosing the proper group size, retraining epoch number, and weight threshold. To further improve the compression rate of SWQC, a new strategy for weight encoding, unbalanced encoding, is proposed in this paper. This strategy is able to compress one or multiple quantized weights into one bit, thereby achieving a higher compression rate. Experiments are performed on a 4-bit QNN using the CIFAR10 dataset. The results show that unbalanced encoding achieves a higher compression rate for the layers with large-quantity parameters. By using mixed encoding which combines balanced and unbalanced encoding in different layers can achieve a higher compression rate than using one of them only. In the experiments with CIFAR10, unbalanced encoding gets the compression rate of over 13X in the fully connected layer. By comparison, the compression rate of SWQC with the incorporation of unbalanced encoding achieves more than 5X higher than using balanced encoding only.

Full Text