Using Distillation to Improve Network Performance after Pruning and Quantization

Zhenshan Bao,Wenbo Zhang,Jiayang Liu

doi:10.1145/3366750.3366751

Abstract

As the complexity of processing issues increases, deep neural networks require more computing and storage resources. At the same time, the researchers found that the deep neural network contains a lot of redundancy, causing unnecessary waste, and the network model needs to be further optimized. Based on the above ideas, researchers have turned their attention to building more compact and efficient models in recent years, so that deep neural networks can be better deployed on nodes with limited resources to enhance their intelligence. At present, the deep neural network model compression method have weight pruning, weight quantization, and knowledge distillation and so on, these three methods have their own characteristics, which are independent of each other and can be self-contained, and can be further optimized by effective combination. This paper will construct a deep neural network model compression framework based on weight pruning, weight quantization and knowledge distillation. Firstly, the model will be double coarse-grained compression with pruning and quantization, then the original network will be used as the teacher network to guide the compressed student network. Training is performed to improve the accuracy of the student network, thereby further accelerating and compressing the model to make the loss of accuracy smaller. The experimental results show that the combination of three algorithms can compress 80% FLOPs and reduce the accuracy by only 1%.

Full Text