Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Xiangpeng Li,Lianli Gao,Jingkuan Song,Xianglong Liu,Xiangnan He,Wenbing Huang,Chuang Gan

doi:10.1609/aaai.v33i01.33018658

Abstract

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jul 17, 2019
Citations: 197

Similar Papers

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Visual Question Generation as Dual Task of Visual Question Answering
Yikang Li ... Nan Duan
-
Yikang Li, et. al.Yikang Li ... Nan Duan
01 Jun 2018
01 Jun 2018

Convolutional Neural Networks-Based VQA Model
Himanshu Sharma ... Anand Singh Jalal
-
Himanshu Sharma, et. al.Himanshu Sharma ... Anand Singh Jalal
28 Jun 2022
28 Jun 2022

Coarse-to-Fine Reasoning for Visual Question Answering
Binh X Nguyen ... Anh Nguyen
-
Binh X Nguyen, et. al.Binh X Nguyen ... Anh Nguyen
01 Jun 2022
01 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence