Coordinate Attention Guided Dual-Teacher Adaptive Knowledge Distillation for image classification

Dongtong Ma,Kaibing Zhang,Qizhi Cao,Jie Li,Xinbo Gao

doi:10.1016/j.eswa.2024.123892

Abstract

Knowledge distillation (KD) refers to transferring the knowledge learned from a teacher network with complex architecture and strong learning ability to another student network with light-weight and weak learning ability through a specific distillation strategy. However, most existing KD approaches to image classification often employ a single teacher network to guide the training of the student network. When the teacher network makes an erroneous prediction, the transferred knowledge will deteriorate the performance of the student network. To address or mitigate the above issue, we develop a novel KD approach called Coordinate Attention Guided Dual-Teacher Adaptive Knowledge Distillation (CAG-DAKD), to deliver more discriminative and comprehensive knowledge obtained from two teacher networks to a compact student network. Specifically, we integrate the positive prediction distribution of two teacher networks according to whether the two teacher networks predict correctly and the magnitude of the cross-entropy to deliver better output distribution to guide the student network. Furthermore, to distill the most valuable knowledge from the first teacher network that has a similar architecture to the student network, a coordinate attention mechanism is introduced into different layers of the first teacher network so that the student network can effectively learn more discriminative feature representations. We conduct extensive experiments on three standard image classification datasets: CIFAR10, CIFAR100, and ImageNet to verify the superiority of the proposed method over other state-of-the-art competitors. Code will be available at https://github.com/mdt1219/CAG-DAKD.git/

Full Text