Learning cross-modal correlations by exploring inter-word semantics and stacked co-attention

Jing Yu,Yuhang Lu,Weifeng Zhang,Zengchang Qin,Yanbing Liu,Yue Hu

doi:10.1016/j.patrec.2018.08.017

Abstract

Cross-modal information retrieval aims to find heterogeneous data of various modalities from a given query of one modality. The main challenge is to learn the semantic correlations between different modalities and measure the distance across modalities. For text-image retrieval, existing work mostly uses off-the-shelf Convolutional Neural Network (CNN) for image feature extraction. For texts, word-level features such as bag-of-words or word2vec are employed to build deep learning models to represent texts. Besides word-level semantics, the semantic relations between words are also informative but less explored. In this paper, we explore the inter-word semantics by modelling texts by graphs using similarity measure based on word2vec. Besides feature presentations, we further study the problem of information imbalance between different modalities when describing the same semantics. For example textual descriptions often contain more background information that cannot be conveyed by images and vice versa. We propose a stacked co-attention network to progressively learn the mutually attended features of different modalities and enhance their fine-grained correlations. A dual-path neural network is proposed for cross-modal information retrieval. The model is trained by a pairwise similarity loss function to maximize the similarity of relevant text-image pairs and minimize the similarity of irrelevant pairs. Experimental results show that the proposed model outperforms the state-of-the-art methods significantly, with 19% improvement on accuracy for the best case.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning cross-modal correlations by exploring inter-word semantics and stacked co-attention

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters

Lead the way for us

Journal: Pattern Recognition Letters	Publication Date: Aug 17, 2018
Citations: 6

Similar Papers

Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval
Jing Yu ... Yanbing Liu
-
Jing Yu, et. al.Jing Yu ... Yanbing Liu
01 Jan 2018
01 Jan 2018

Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval
Yuhang Lu ... Li Guo
-
Yuhang Lu, et. al.Yuhang Lu ... Li Guo
01 Jan 2018
01 Jan 2018

Semantic Modeling of Textual Relationships in Cross-modal Retrieval
Jing Yu ... Chenghao Yang
-
Jing Yu, et. al.Jing Yu ... Chenghao Yang
01 Jan 2019
01 Jan 2019

Multi-task framework based on feature separation and reconstruction for cross-modal retrieval
Li Zhang ... Xiangqian Wu
Pattern Recognition | VOL. 122
Li Zhang, et. al.Li Zhang ... Xiangqian Wu
02 Aug 2021
Pattern Recognition | VOL. 122

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning cross-modal correlations by exploring inter-word semantics and stacked co-attention

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition Letters