Abstract

• MCRN is a unified architecture for referring expression grounding. • The appearance details information can be highlighted by Local Node Attention. • Interpretability of reasoning can be enhanced by multi-step reason. • Multi-step Graph Reasoning can explore relationship context. Referring expression grounding plays a fundamental role in vision-language understanding, which aims at locating a certain target region in an image described by a natural language expression. It needs to understand high-level semantic correlations between objects in the image according to the referred expression for the task. Thus, it inherently requires reasoning the context information, i.e ., appearance context and relationship context. While most existing approaches either ignore to explore the appearance details of the target region or rely on a manually designed reasoning structure and treat the context information of each neighboring object equivalently, inflexible to the scenario where referring expressions are complicated. In this paper, we put forward Multi-context Reasoning Network (MCRN) for referring expression grounding task, which can apply appearance context reasoning and relationship context reasoning simultaneously. Methodologically, for appearance context reasoning, we propose a local node attention to obtain local representation of the target object, which gives a more focus on its appearance details. For relationship context reasoning, we approach it as a language-guided multi-step reasoning problem and design a multi-step graph reasoning module to capture intra-context and inter-context between the target region of its intra-class and inter-class neighboring objects in an iterative way, which makes the reasoning process more reliable and interpretable. Our method demonstrates superiority based on extensive experimental outputs on three popular benchmark datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call