Knowledge Distillation Across Vision and Language

Zhiyuan Fang,Yezhou Yang

doi:10.1007/978-3-031-32095-8_3

Abstract

Recent years have witnessed the fast development of Vision and Language (VL) learning with deep neural architectures, benefiting from large amounts of unlabeled or weakly labeled, and heterogeneous forms of data. A notable challenge arises when deploying these cross-modal models on an edge device that usually has the limited computational power to be undertaken. It is impractical for real-world applications to exploit the power of prevailing models under a constrained training/inference budget, especially when dealing with abundant multi-modal data that requires significant resources. As an important technique for deep model compressing, Knowledge DistillationKnowledge distillation (KD) has been widely applied to various tasks that aim to build a small yet powerful student model from a large teacher model. In the furtherance of gaining a more real-world practical compact Vision-language model, it is essential to exploit knowledge distillation techniques to improve VL learning on compact models. KD + VL, is of great practical value but is less explored in the previous literature. This inspires the studies of KDKnowledge distillation on different modalities (Vision, Language); learning tasks (Self-supervised learning, Contrastive Learning); and also will serve as a vital role for many sub-disciplines of cross-modal tasks, such as image/video captioning, visual-question answering, image/video retrieval, etc. This chapter summarizes and discusses in-depth this emerging area of research. We first retrospect the existing KD algorithmsDistillation algorithms comprehensively, and mainly focuses on their applications for Vision-Language tasks as well as the recent coming-up works leverage the self-distillationSelf-distillation technique for training joint VL representation. The chapter also features recent works at the forefront of research on this topic leveraging KD for Vision-Language representation learningRepresentation Learning, as well as its application to downstream VL tasks, captioning, visual question answering, and so forth. The chapter further discusses and studies variations of KDKnowledge distillation applied to Vision and Language-related topics.

Full Text