Abstract

Knowledge distillation (KD) transfers knowledge from a heavy teacher network to a lightweight student network while maintaining the student’s performance closely to that of the teacher. However, the large gap between the teacher and the student in terms of capacity is not conducive to KD. Consequently, a large teacher network is not necessarily the most suitable teacher to guide the student. Therefore, this study proposes a multiple homogeneous teacher-guided KD method. First, multiple networks with the same structure as that of the student are pretrained to act as a teacher group, which is different from utilizing a large teacher network in traditional KD, to alleviate the capacity gap between the teacher and student. Second, a confidence-adaptive initialization strategy is developed to initialize the student network, which learns knowledge from the pretrained teacher group. Experiments are performed on CIFAR10, CIFAR100, and Tiny-ImageNet using three networks. The experimental results demonstrate that the proposed KD method outperforms existing advanced KD methods. Furthermore, a similarity loss function is introduced to optimize the parameters of the classifier in the student network. The experimental results indicate that this loss function improves the performance for basic classification tasks without KD and efficiently works in the proposed KD method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call