Effective Online Knowledge Distillation via Attention-Based Model Ensembling

Diana-Laura Borza,Tudor Alexandru Ileni,Adrian Sergiu Darabant,Alexandru-Ion Marinescu

doi:10.3390/math10224285

Diana-Laura Borza, Tudor Alexandru Ileni + Show 2 more

Open Access

https://doi.org/10.3390/math10224285

Copy DOI

Journal: Mathematics	Publication Date: Nov 16, 2022
Citations: 1	License type: CC BY 4.0

Affiliation: Babeș-Bolyai University

Abstract

Large-scale deep learning models have achieved impressive results on a variety of tasks; however, their deployment on edge or mobile devices is still a challenge due to the limited available memory and computational capability. Knowledge distillation is an effective model compression technique, which can boost the performance of a lightweight student network by transferring the knowledge from a more complex model or an ensemble of models. Due to its reduced size, this lightweight model is more suitable for deployment on edge devices. In this paper, we introduce an online knowledge distillation framework, which relies on an original attention mechanism to effectively combine the predictions of a cohort of lightweight (student) networks into a powerful ensemble, and use this as a distillation signal. The proposed aggregation strategy uses the predictions of the individual students as well as ground truth data to determine a set of weights needed for ensembling these predictions. This mechanism is solely used during system training. When testing or at inference time, a single, lightweight student is extracted and used. The extensive experiments we performed on several image classification benchmarks, both by training models from scratch (on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets) and using transfer learning (on Oxford Pets and Oxford Flowers datasets), showed that the proposed framework always leads to an improvement in the accuracy of knowledge-distilled students and demonstrates the effectiveness of the proposed solution. Moreover, in the case of ResNet architecture, we observed that the knowledge-distilled model achieves a higher accuracy than a deeper, individually trained ResNet model.

Full Text