Knowledge in attention assistant for improving generalization in deep teacher–student models

Sajedeh Morabbi,Hadi Soltanizadeh,Saeed Mozaffari,Mohammad Javad Fadaeieslam,Shib Sankar Sana

doi:10.1080/02286203.2024.2389562

Abstract

ABSTRACT Research on knowledge distillation has become active in deep neural networks. Knowledge distillation involves training a low-capacity model from a high-capacity model. However, when the capacities of the teacher and student models differ, it can result to poor learning and low generalization performance. We propose here a novel teacher assistant model called Knowledge in Attention Assistant. This model learns learning a discriminative representation of important regions and statistical information along with spatial and channel knowledge. Moreover, by using a triplet attention mechanism, the student model can learn both the inner and outer distribution of different categories, and also memorize the knowledge distribution of the teacher model. This alignment improves the effectiveness and generalization of knowledge distillation and reduces the capacity gap between the teacher and student models. The present model addresses feature inconsistency by adjusting the attention weight distribution based on the resemblance between the features of the teacher and student. The evaluation of the proposed teacher assistant method shows remarkable results. The student model outperforms the teacher model in terms of generalization performance, achieving improvements of 93.37% and 94.09% on CIFAR-10 and CIFAR-100 datasets, respectively. Furthermore, the proposed model enhances the F1-scores 91.98% on CIFAR-10 and 79.69% on CIFAR-100.

Full Text