Abstract

In image-sentence retrieval task, correlated images and sentences involve different levels of semantic relevance. However, existing multi-modal representation learning paradigms fail to capture the meaningful component relation on word and phrase level, while the attention-based methods still suffer from component-level mismatching and huge computation burden. We propose a Joint Global and Co-Attentive Representation learning method (JGCAR) for image-sentence retrieval. We formulate a global representation learning task which utilizes both intra-modal and inter-modal relative similarity to optimize the semantic consistency of the visual/textual component representations. We further develop a co-attention learning procedure to fully exploit different levels of visual-linguistic relations. We design a novel softmax-like bi-directional ranking loss to learn the co-attentive representation for image-sentence similarity computation. It is capable of discovering the correlative components and rectifying inappropriate component-level correlation to produce more accurate sentence-level ranking results. By joint global and co-attentive representation learning, the latter benefits from the former by producing more semantically consistent component representation, and the former also benefits from the latter by back-propagating the contextual information. Image-sentence retrieval is performed as a two-step process in the testing stage, inheriting advantages on both effectiveness and efficiency. Experiments show that JGCAR outperforms existing methods on MSCOCO and Flickr30K image-sentence retrieval tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call