Abstract

In case the spatial shape of the feature maps of the teacher in feature-based knowledge distillation (KD) is significantly greater than the student model, first, they cannot be compared directly. Second, the knowledge of these complex feature maps cannot be quite apprehensible for the student. This paper proposed a new KD, in which Tucker decomposition was used to decompose the large-dimension feature maps of a teacher to obtain core tensors from the feature maps of the teacher. The knowledge of these tensors can be easily understood by students due to their low complexity. Furthermore, in the proposed KD, an adaptor function is suggested, which balances the spatial shape of the core tensors of the teacher and student and helps compare them using a convolution regressor. Finally, a hybrid loss based on adaptor function is suggested to distill the knowledge of the core tensors of the teacher to the student. Both teacher and student models were implemented on smartphones used as edge devices, and the experiments were evaluated in terms of recognition rate and complexity. According to the results, the student model designed by ResNet-18 architecture has ∼65.44 million fewer parameters, ∼6.45 GFLOPs less computational complexity, ∼1.12 G less GPU memory use, and ∼265.67 times greater compression rate than its teacher model designed by ResNet-50 architecture. While the recognition rate of the student model merely dropped down to 1.5% in the benchmark dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call