Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval.

Li Zhang,Xiangqian Wu

doi:10.1109/tip.2022.3220051

Abstract

As an important field in information retrieval, fine-grained cross-modal retrieval has received great attentions from researchers. Existing fine-grained cross-modal retrieval methods made several improvements in capturing the fine-grained interplay between vision and language, failing to consider the fine-grained correspondences between the features in the image latent space and the text latent space respectively, which may lead to inaccurate inference of intra-modal relations or false alignment of cross-modal information. Considering that object detection can get the fine-grained correspondences of image region features and the corresponding semantic features, this paper proposed a novel latent space semantic supervision model based on knowledge distillation (L3S-KD), which trains classifiers supervised by the fine-grained correspondences obtained from an object detection model by using knowledge distillation for image latent space fine-grained alignment, and by the labels of objects and attributes for text latent space fine-grained alignment. Compared with existing fine-grained correspondence matching methods, L3S-KD can learn more accurate semantic similarities for local fragments in image-text pairs. Extensive experiments on MS-COCO and Flickr30K datasets demonstrate that the L3S-KD model consistently outperforms state-of-the-art methods for image-text matching.

Full Text