Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Aihua Mao,Yong-Jin Liu,Zhi Yang,Ken Lin,Jun Xuan

doi:10.1109/tmm.2022.3216770

Abstract

Transformer architectures have recently been introduced into the field of visual question answering (VQA), due to their powerful capabilities of information extraction and fusion. However, existing Transformer-like models, including models using a single Transformer structure and large-scale pre-training generic visual-linguistic models, do not fully utilize both positional information of words in questions and positional information of objects in images, which are shown in this paper to be crucial in VQA tasks. To address this challenge, we propose a novel positional attention guided Transformer-like architecture, which can adaptively extracts positional information within and across the visual and language modalities, and use this information to guide high-level interactions in inter- and intra-modality information flows. In particular, we design and assemble three positional attention modules into a single Transformer-like model MCAN. We show that the positional information introduced in intra-modality interaction can adaptively modulate inter-modality interaction according to different inputs, which plays an important role for visual reasoning. Experimental results demonstrate that our model outperforms the state-of-the-art models and is particularly good at handling object counting questions. Overall, our model achieves the accuracy of 70.10%, 71.27%, 71.52% on the datasets of COCO-QA, VQA v1.0 test-std and VQA v2.0 test-std, respectively. The source code will be publicly available at https://github.com/waizei/PositionalMCAN.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia

Lead the way for us

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2023
Citations: 4

Similar Papers

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal ... Devi Parikh
-
Yash Goyal, et. al.Yash Goyal ... Devi Parikh
01 Jul 2017
01 Jul 2017

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal ... Tejas Khot
International Journal of Computer Vision | VOL. 127
Yash Goyal, et. al.Yash Goyal ... Tejas Khot
11 Sep 2018
International Journal of Computer Vision | VOL. 127

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Overcoming language priors in visual question answering with cumulative learning strategy
Aihua Mao ... Ken Lin
Neurocomputing | VOL. 608
Aihua Mao, et. al.Aihua Mao ... Ken Lin
22 Aug 2024
Neurocomputing | VOL. 608

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia