Transformer-Based Relational Inference Network for Complex Visual Relational Reasoning

Mingkui Tan,Zhiquan Wen,Qi Wu,Leyuan Fang

doi:10.1145/3605781

Abstract

Visual Relational Reasoning is the basis of many vision-and-language based tasks (e.g., visual question answering and referring expression comprehension). In this article, we regard the complex referring expression comprehension (c-REF) task as the reasoning basis, in which c-REF seeks to localise a target object in an image guided by a complex query. Such queries often contain complex logic and thus impose two critical challenges for reasoning: (i) Comprehending the complex queries is difficult since these queries usually refer to multiple objects and their relationships; (ii) Reasoning among multiple objects guided by the queries and then localising the target correctly are non-trivial. To address the above challenges, we propose a Transformer-based Relational Inference Network (Trans-RINet). Specifically, to comprehend the queries, we mimic the language-comprehending mechanism of humans, and devise a language decomposition module to decompose the queries into four types, i.e., basic attributes, absolute location, visual relationship and relative location. We further devise four modules to address the corresponding information. In each module, we consider the intra-(i.e., between the objects) and inter-modality relationships(i.e., between the queries and objects) to improve the reasoning ability. Moreover, we construct a relational graph to represent the objects and their relationships, and devise a multi-step reasoning method to progressively understand the complex logic. Since each type of the queries is closely related, we let each module interact with each other before making a decision. Extensive experiments on the CLEVR-Ref+, Ref-Reasoning, and CLEVR-CoGenT datasets demonstrate the superior reasoning performance of our Trans-RINet.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Transformer-Based Relational Inference Network for Complex Visual Relational Reasoning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Aug 25, 2023
Citations: 1

Similar Papers

Exploiting the Social-Like Prior in Transformer for Visual Reasoning
Yudong Han ... Xuemeng Song
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Yudong Han, et. al.Yudong Han ... Xuemeng Song
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention
Wei Suo ... Peng Wang
-
Wei Suo, et. al.Wei Suo ... Peng Wang
01 Aug 2021
01 Aug 2021

TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
Yiyi Zhou ... Jianzhuang Liu
-
Yiyi Zhou, et. al.Yiyi Zhou ... Jianzhuang Liu
01 Oct 2021
01 Oct 2021

CAT: Re-Conv Attention in Transformer for Visual Question Answering
Haotian Zhang ... Wei Wu
-
Haotian Zhang, et. al.Haotian Zhang ... Wei Wu
21 Aug 2022
21 Aug 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Transformer-Based Relational Inference Network for Complex Visual Relational Reasoning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications