Abstract

Deep neural networks (DNNs) usually have multiple layers and thousands of trainable parameters to ensure high accuracy. Due to the requirement of large amounts of computation and memory, these networks are not suitable for real-time and resource-constrained mobile or embedded systems. Various techniques such as network pruning, weight sharing, network quantization, and weight encoding have been proposed to improve computational and memory efficiency. This paper presents a synchronous weight quantization-compression (SWQC) technique to compress the weights of low-bit quantized neural network (QNN). Specifically, it quantizes the weights not strictly according to their values but based on compression efficiency and their probabilities of being different quantized results. In the process of weight quantization, the compression efficiency of weights is considered as an important factor. With the help of retraining, a high compression rate and accuracy can be achieved. Verification is performed on 4-bit QNNs using the MNIST and CIFAR10 datasets. Results show that no classification accuracy is lost when the compression rate approaches 5.4X and 4.4X for the two datasets, respectively. The compression rate of the MNIST experiment is increased to 12.1X with a 1% accuracy drop, while the CIFAR10 experiment achieves a compression rate of 5.6X with the accuracy drop of about 0.6%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.