Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Zheng-Jun Zha,Jiawei Liu,Yongdong Zhang,Tianhao Yang

doi:10.1145/3320061

Abstract

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for video question answering. This article presents a novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering. The STCA-Net jointly learns spatially and temporally visual attention on videos as well as textual attention on questions. It concentrates on the essential cues in both visual and textual spaces for answering question, leading to effective question-video representation. In particular, a question-guided attention network is designed to learn question-aware video representation with a spatial-temporal attention module. It concentrates the network on regions of interest within the frames of interest across the entire video. A video-guided attention network is proposed to learn video-aware question representation with a textual attention module, leading to fine-grained understanding of question. The learned video and question representations are used by an answer predictor to generate answers. Extensive experiments on two challenging datasets of video question answering, i.e., MSVD-QA and MSRVTT-QA, have shown the effectiveness of the proposed approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Apr 30, 2019
Citations: 40

Similar Papers

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
-
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15
--
30 Apr 2019
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15

Multimodal feature fusion by relational reasoning and attention for visual question answering
Weifeng Zhang ... Zengchang Qin
Information Fusion | VOL. 55
Weifeng Zhang, et. al.Weifeng Zhang ... Zengchang Qin
19 Aug 2019
Information Fusion | VOL. 55

Video question answering by frame attention
Jiannan Fang ... Lingling Sun
-
Jiannan Fang, et. al.Jiannan Fang ... Lingling Sun
14 Aug 2019
14 Aug 2019

Visual Question Answering as Reading Comprehension
Hui Li ... Anton Van Den Hengel
-
Hui Li, et. al.Hui Li ... Anton Van Den Hengel
01 Jun 2019
01 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications