Abstract

Visual Relational Reasoning is the basis of many vision-and-language based tasks (e.g., visual question answering and referring expression comprehension). In this article, we regard the complex referring expression comprehension (c-REF) task as the reasoning basis, in which c-REF seeks to localise a target object in an image guided by a complex query. Such queries often contain complex logic and thus impose two critical challenges for reasoning: (i) Comprehending the complex queries is difficult since these queries usually refer to multiple objects and their relationships; (ii) Reasoning among multiple objects guided by the queries and then localising the target correctly are non-trivial. To address the above challenges, we propose a Transformer-based Relational Inference Network (Trans-RINet). Specifically, to comprehend the queries, we mimic the language-comprehending mechanism of humans, and devise a language decomposition module to decompose the queries into four types, i.e., basic attributes, absolute location, visual relationship and relative location. We further devise four modules to address the corresponding information. In each module, we consider the intra-(i.e., between the objects) and inter-modality relationships(i.e., between the queries and objects) to improve the reasoning ability. Moreover, we construct a relational graph to represent the objects and their relationships, and devise a multi-step reasoning method to progressively understand the complex logic. Since each type of the queries is closely related, we let each module interact with each other before making a decision. Extensive experiments on the CLEVR-Ref+, Ref-Reasoning, and CLEVR-CoGenT datasets demonstrate the superior reasoning performance of our Trans-RINet.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call