Abstract
Knowledge distillation is one of the most persuasive approaches to model compression that transfers the representational expertise from large deep-learning teacher models to a small student network. Although numerous techniques have been proposed to improve teacher representations at the logits level, no study has examined the weaknesses in the representations of the teacher at the feature level during distillation. On the other hand, in a trained deep-learning model, all the kernels are not uniformly activated to make specific predictions. Transferring this knowledge may result in a student learning a suboptimal intrinsic distribution and restrict the existing distillation methods from exploiting their highest potential. Motivated by these issues, this study analyses the generalization capability of teachers with or without a uniformly activated channel distribution. Preliminary investigations and theoretical analyses show that partly uniforming or smoothing feature maps offer improved representation that enriches the generalization capability. Based on these observations, it is hypothesized that distillation-based explicit supervision using smoothed feature maps and cross-entropy loss plays a significant role in improving generalization. Hence, this paper proposes a novel technique called Partly Unified Recalibrated Feature (PURF) map distillation. The proposed method recalibrates the feature maps by intercommunicating the representational cues among nearest-neighbor channels. PURF increases the performance of state-of-the-art knowledge distillation (KD) methods across architectures by improving generalization, model compression, few-shot training, transferability, and robustness transfer on standard benchmark datasets. PURF achieves 1.51% average accuracy improvements on seven diverse architectures in image classification tasks. PURF increases the performance of the state-of-the-art knowledge distillation methods by an average accuracy of 1.91% across architectures. Moreover, PURF achieves an average of 2.02%, and 0.96% higher accuracy in transferability and robustness tasks, respectively, on standard benchmark datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.