Abstract

The transformer-based model can simultaneously learn the representation for both images and text, providing excellent performance for multimodal applications. Practically, the large scale of parameters may hinder its deployment in resource-constrained devices, creating a need for model compression. To accomplish this goal, recent studies suggest using knowledge distillation to transfer knowledge from a larger trained teacher model to a small student model without any performance sacrifice. However, this only works with trained parameters of the student model by using the last layer of the teacher, which makes the student model easily overfit in the distillation procedure. Furthermore, the mutual interference between modalities causes more difficulties for distillation. To address these issues, the study proposed a layerwised multimodal knowledge distillation for a vision-language pretrained model. In addition to the last layer, the intermediate layers of the teacher were also used for knowledge transfer. To avoid interference between modalities, we split the multimodality into separate modalities and added them as extra inputs. Then, two auxiliary losses were implemented to encourage each modality to distill more effectively. Comparative experiments on four different multimodal tasks show that the proposed layerwised multimodality distillation achieves better performance than other KD methods for vision-language pretrained models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call