Abstract

The massive parameters and intensive computations in neural networks always limit the deployment on embedded devices with poor storage and computing power. To solve this problem, a novel quantization algorithm combined with Knowledge Distillation (KD) is proposed to reduce the model size and speed up the inference of deep models. The proposed method consists of two phases, KD-Training and Quantization-Retraining. KD-Training attempts to train a compact student model with pre-quantized weights by the proposed pre-quantized constraint loss. In Quantization-Retraining, the pre-quantized weights are quantized to 2n and the first and last layers of the network are retained to make up for the accuracy loss caused by quantization. Experiments on the CIFAR-10 dataset show that the proposed method can obtain a low-precision (2-5bit) quantized student model with a compact structure, and the test accuracy even exceeds its full-precision(32bit) reference as the improvement of the generalization ability. It can get higher performance compared with that obtained by other quantization methods. Also, since the quantized weights are constrained in { ± 2n}, it is suitable for the acceleration of the network calculation in hardware.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.