Visual Semantic Embedding (VSE) is a dominant method for cross-modal image–text retrieval. The purpose of VSE is to learn an embedding space where images can be embedded close to the corresponding captions. However, there are large intra-class variations in image–text data. Multiple captions describing the same image may be described from different views, and descriptions of different views are often dissimilar. The VSE method embeds samples from the same class in similar positions, which suppresses intra-class variations and leads to inferior generalization. This paper proposes a Multi-View Visual Semantic Embedding (MV-VSE) framework that learns multiple embeddings for an image, explicitly modeling intra-class variation. To optimize the MV-VSE framework, a multi-view triplet loss is proposed, which jointly optimizes multi-view embeddings while retaining intra-class variation. Recently, large-scale Vision-Language Pre-training (VLP) has become a new paradigm for cross-modal image–text retrieval. To allow our framework to be flexibly applied to the traditional VSE models and VSE-based VLP models, we incorporate the contrastive loss commonly used in VLP and the triplet loss into a unified loss, and further propose a multi-view unified loss. Our framework can be applied plug-and-play to traditional VSE models and VSE-based VLP models without excessively increasing model complexity. Experimental results on the image–text retrieval benchmark datasets demonstrate that applying our framework can boost the retrieval performance of current VSE models. The code is available at https://github.com/AAA-Zheng/MV-VSE.
Read full abstract