Cross-Layer Fusion for Feature Distillation

Honglin Zhu,Peng Zhang,Ning Jiang,Wenqing Wu,Xinlei Huang,Haifeng Qing,Jialiang Tang

doi:10.1007/978-981-99-1639-9_36

Abstract

Knowledge distillation is a model compression technology, which can effectively improve the performance of a small student network by learning knowledge from a large pre-trained teacher network. In most previous works of feature distillation, the performance of the student is still lower than the teacher network due to it only being supervised by the teacher’s features and labels. In this paper, we novelly propose Cross-layer Fusion for Knowledge Distillation named CFKD. Specifically, instead of only using the features of the teacher network, we aggregate the features of the teacher network and student network together by a dynamic feature fusion strategy (DFFS) and a fusion module. The fused features are informative, which not only contain expressive knowledge of teacher network but also have the useful knowledge learned by previous student network. Therefore, the student network learning from the fused features can achieve comparable performance with the teacher network. Our experiments demonstrate that the performance of the student network can be trained by our method, which can be closer to the teacher network or even better.

Full Text