Dual self-attention with co-attention networks for visual question answering

Yun Liu,Xiaoming Zhang,Qianyun Zhang,Chaozhuo Li,Feiran Huang,Xianghong Tang,Zhoujun Li

doi:10.1016/j.patcog.2021.107956

Abstract

Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three submodules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dual self-attention with co-attention networks for visual question answering

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Journal: Pattern Recognition	Publication Date: Apr 9, 2021
Citations: 41

Similar Papers

Visual Question Answering using Convolutional Neural Networks
K P Moholkar, Et Al
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12
K P Moholkar, Et AlK P Moholkar, Et Al
11 Apr 2021
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 12

Learning Convolutional Text Representations for Visual Question Answering
Zhengyang Wang ... Shuiwang Ji
-
Zhengyang Wang, et. al.Zhengyang Wang ... Shuiwang Ji
07 May 2018
07 May 2018

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge.
Qi Wu ... Chunhua Shen
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 40
Qi Wu, et. al.Qi Wu ... Chunhua Shen
26 May 2017
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 40

Convolutional Neural Networks-Based VQA Model
Himanshu Sharma ... Anand Singh Jalal
-
Himanshu Sharma, et. al.Himanshu Sharma ... Anand Singh Jalal
28 Jun 2022
28 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dual self-attention with co-attention networks for visual question answering

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition