Abstract
Image-text matching is an attractive research topic in the community of vision and language. The key element to narrow the “heterogeneity gap” between visual and textual data lies in how to learn powerful and robust representations for both modalities. This paper proposes to alleviate this issue to achieve the fine-grained visual-textual alignment from two aspects: exploiting attention mechanism to locate the semantically meaningful portion and leveraging the memory network to capture the long-term contextual knowledge. Unlike most existing studies sorely focus on exploring the cross-modal associations at the fragment level, our designed Collaborative Dual Attention (CDA) module is able to model the semantic interdependencies from both perspectives of fragment and channel. Furthermore, considering the usage of long-term contextual knowledge contributes to compensate for detailed semantics concealed in the rarely appeared image-text pairs, we present to learn the joint representations by constructing a Multi-Modal Memory Enhancement (M3E) module. Specifically, it sequentially restores the intra-modal and multi-modal information into the memory items, and they conversely persistently memorize cross-modal shared semantics to improve the latent embeddings. By incorporating both CDA and M3E modules into a deep architecture, our approach generates more semantically consistent embeddings for representing images and texts. Extensive experiments demonstrate our model can achieve the state-of-the-art results on two public benchmark datasets.
Highlights
INTRODUCTIONWith the prevalence of deep learning in computer vision and natural language processing, vision and language understanding [1] has made tremendous progress
In recent years, with the prevalence of deep learning in computer vision and natural language processing, vision and language understanding [1] has made tremendous progress. It includes a variety of downstream applications, such as image caption [2], visual question answering(VQA) [3], visual grounding [4] and image-text matching [5], [6]
As the central task of this community, in this paper, we focus on addressing the problem of image-text matching, which aims at seeking out the corresponding textual descriptions given images or the corresponding images given textual descriptions
Summary
With the prevalence of deep learning in computer vision and natural language processing, vision and language understanding [1] has made tremendous progress. Some recent many approaches resort to employ attention mechanism [2], [11] on local representations, which attempts to focus on semantically salient parts of images and texts for capturing important information Among those attention based methods, the self-attention embedding [52] network is proposed to exploit the intra-modal attention to capture the fragment dependencies among image patches according to the contextual correlations, which is characterized. 2) We present a Multi-Modal Memory Enhancement (M3E) module, which contributes to exploit long-term contextual knowledge to improve the joint embeddings It is characterized by combining both intramodal and multi-modal information to constitute the memory components, which fully utilizes the semantically complementary contents to learn the cross-modal alignment. The structure of text branch is similar to the image branch
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have