Abstract

The prime focus of knowledge distillation (KD) seeks a light proxy termed student to mimic the outputs of its heavy neural networks termed teacher, and makes the student run real-time on the resource-limited devices. This paradigm requires aligning the soft logits of both teacher and student. However, few doubts whether the process of softening the logits truly give full play to the teacher-student paradigm. In this paper, we launch several analyses to delve into this issue from scratch. Subsequently, several simple yet effective functions are devised to replace the vanilla KD. The ultimate function can be an effective alternative to its original counterparts and work well with other skills like FitNets. To claim this point, we conduct several visual tasks on individual benchmarks, and experimental results verify the potential of our proposed function in terms of performance gains. For example, when the teacher and student networks are ShuffleNetV2-1.0 and ShuffleNetV2-0.5, our proposed method achieves 40.88%top-1 error rate on Tiny ImageNet.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.