Modality-specific Distillation

Woojeong Jin,Shaoliang Nie,Maziar Sanjabi,Liang Tan,Xiang Ren,Hamed Firooz

doi:10.18653/v1/2021.maiworkshop-1.7

Abstract

Large neural networks are impractical to deploy on mobile devices due to their heavy computational cost and slow inference. Knowledge distillation (KD) is a technique to reduce the model size while retaining performance by transferring knowledge from a large “teacher” model to a smaller “student” model. However, KD on multimodal datasets such as vision-language datasets is relatively unexplored and digesting such multimodal information is challenging since different modalities present different types of information. In this paper, we propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets. Existing KD approaches can be applied to multimodal setup, but a student doesn’t have access to modality-specific predictions. Our idea aims at mimicking a teacher’s modality-specific predictions by introducing an auxiliary loss term for each modality. Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses; a meta-learning approach to learn the optimal weights on these loss terms. In our experiments, we demonstrate the effectiveness of our MSD and the weighting scheme and show that it achieves better performance than KD.

Highlights

Recent advances in computer vision and natural language processing are attributed to deep neural networks with large number of layers
We demonstrate the effectiveness of our modality-specific distillation (MSD) and the weighting scheme and show that it achieves better performance than knowledge distillation (KD)
KD has been explored in various studies such as improving a student model (Hinton et al, 2015; Park et al, 2019; Romero et al, 2014; Tian et al, 2019; Muller et al, 2020) and improving a teacher model itself by self-distillation (Xie et al, 2020; Kim et al, 2020; Furlanello et al, 2018)

Summary

Introduction

Recent advances in computer vision and natural language processing are attributed to deep neural networks with large number of layers. Current state-of-the-art architectures are getting wider and deeper with billions of parameters, e.g., BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020) In addition to their huge sizes, such wide and deep models suffer from high computational costs and latencies at inference. These shortcomings greatly knowledge distillation (KD) (Hinton et al, 2015) assumes the knowledge in the teacher as a learned mapping from inputs to outputs, and transfers the knowledge by training the student model with the teacher’s outputs (of the last or a hidden layer) as targets. We show that datasets are different in the need of population-based or samplespecific weighting; the MM-IMDB dataset, for example, shows less improvement on instance-wise weighting compared to population-based weighting

Background

Limitations

Method

Analysis

Related Work

Findings

A Case Study

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Modality-specific Distillation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 2	License type: cc-by

Similar Papers

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding
Woojeong Jin ... Liang Tan
-
Woojeong Jin, et. al.Woojeong Jin ... Liang Tan
01 Jan 2020
01 Jan 2020

Weight Distillation: Transferring the Knowledge in Neural Network Parameters
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Chapter 8 - Knowledge distillation
Nikolaos Passalis ... Anastasios Tefas
Deep Learning for Robot Perception and Cognition | VOL. -
Nikolaos Passalis, et. al.Nikolaos Passalis ... Anastasios Tefas
01 Jan 2021
Deep Learning for Robot Perception and Cognition | VOL. -

Knowledge Distillation with Distribution Mismatch
Dang Nguyen ... Truyen Tran
-
Dang Nguyen, et. al.Dang Nguyen ... Truyen Tran
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Modality-specific Distillation

Abstract

Highlights

Summary

Talk to us

Similar Papers