Abstract

Knowledge distillation (KD) has been widely used in different tasks as a practical model compression technique. Due to the poor performance of directly using Mean Square Error (MSE) between the intermediate features of the teacher and student, most feature-based detector distillation methods are primarily concerned with proposing diverse attention mechanisms and employing MSE to guide the student in learning critical information. However, the significance of MSE in detector distillation is often overlooked. To enhance the distillation performance of MSE and provide a novel perspective, we propose channel-level Matching Knowledge Distillation via MSE (MKD). The idea of MKD is simple but effective: based on the match or mismatch of student and teacher channel features in Feature Pyramid Networks (FPN), we divide channels at the same location in the FPN layers of the student and teacher into match channels and mismatch channels and then assign different weights to them to calculate the MSE-based distillation loss, which forces the student to distinguish the teacher’s channels and imitate the features of the corresponding channels to achieve better performance. We conduct extensive experiments on MS COCO and PASCAL VOC benchmarks to prove that MKD is superior to state-of-the-art distillation methods. For example, with a powerful RetinaNet-ResNeXt101 detector as the teacher, ResNet-50 based RetinaNet and RepPoints with our distillation method achieve 40.9% and 42.2% mAP on COCO2017, which are 3.5% and 3.6% higher than the baseline, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call