Abstract

Large neural networks are impractical to deploy on mobile devices due to their heavy computational cost and slow inference. Knowledge distillation (KD) is a technique to reduce the model size while retaining performance by transferring knowledge from a large “teacher” model to a smaller “student” model. However, KD on multimodal datasets such as vision-language datasets is relatively unexplored and digesting such multimodal information is challenging since different modalities present different types of information. In this paper, we propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets. Existing KD approaches can be applied to multimodal setup, but a student doesn’t have access to modality-specific predictions. Our idea aims at mimicking a teacher’s modality-specific predictions by introducing an auxiliary loss term for each modality. Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses; a meta-learning approach to learn the optimal weights on these loss terms. In our experiments, we demonstrate the effectiveness of our MSD and the weighting scheme and show that it achieves better performance than KD.

Highlights

  • Recent advances in computer vision and natural language processing are attributed to deep neural networks with large number of layers

  • We demonstrate the effectiveness of our modality-specific distillation (MSD) and the weighting scheme and show that it achieves better performance than knowledge distillation (KD)

  • KD has been explored in various studies such as improving a student model (Hinton et al, 2015; Park et al, 2019; Romero et al, 2014; Tian et al, 2019; Muller et al, 2020) and improving a teacher model itself by self-distillation (Xie et al, 2020; Kim et al, 2020; Furlanello et al, 2018)

Read more

Summary

Introduction

Recent advances in computer vision and natural language processing are attributed to deep neural networks with large number of layers. Current state-of-the-art architectures are getting wider and deeper with billions of parameters, e.g., BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020) In addition to their huge sizes, such wide and deep models suffer from high computational costs and latencies at inference. These shortcomings greatly knowledge distillation (KD) (Hinton et al, 2015) assumes the knowledge in the teacher as a learned mapping from inputs to outputs, and transfers the knowledge by training the student model with the teacher’s outputs (of the last or a hidden layer) as targets. We show that datasets are different in the need of population-based or samplespecific weighting; the MM-IMDB dataset, for example, shows less improvement on instance-wise weighting compared to population-based weighting

Background
Limitations
Method
Analysis
Related Work
Findings
A Case Study

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.