Relationship-Embedded Representation Learning for Grounding Referring Expressions.

Sibei Yang,Guanbin Li,Yizhou Yu

doi:10.1109/tpami.2020.2973983

Abstract

Grounding referring expressions in images aims to locate the object instance in an image described by a referring expression. It involves a joint understanding of natural language and image content, and is essential for a range of visual tasks related to human-computer interaction. As a language-to-vision matching task, the core of this problem is to not only extract all the necessary information (i.e., objects and the relationships among them) in both the image and referring expression, but also make full use of context information to align cross-modal semantic concepts in the extracted information. Unfortunately, existing work on grounding referring expressions fails to accurately extract multi-order relationships from the referring expression and associate them with the objects and their related contexts in the image. In this paper, we propose a cross-modal relationship extractor (CMRE) to adaptively highlight objects and relationships (spatial and semantic relations) related to the given expression with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experimental results on three common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, significantly surpasses all existing state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Relationship-Embedded Representation Learning for Grounding Referring Expressions.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence

Lead the way for us

Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence	Publication Date: Feb 16, 2020
Citations: 43

Similar Papers

Cross-Modal Relationship Inference for Grounding Referring Expressions
Sibei Yang ... Yizhou Yu
-
Sibei Yang, et. al.Sibei Yang ... Yizhou Yu
01 Jun 2019
01 Jun 2019

Modeling Semantic and Behavioral Relations for Query Suggestion
Jimeng Chen ... Yalou Huang
-
Jimeng Chen, et. al.Jimeng Chen ... Yalou Huang
01 Jan 2013
01 Jan 2013

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
Yan Wang ... Peize Li
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20
Yan Wang, et. al.Yan Wang ... Peize Li
23 Oct 2023
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20

Image captioning based on deep reinforcement learning
Haichao Shi ... Peng Li
-
Haichao Shi, et. al.Haichao Shi ... Peng Li
17 Aug 2018
17 Aug 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Relationship-Embedded Representation Learning for Grounding Referring Expressions.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence