Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Weike Jin,Jie Li,Yimeng Li,Zhou Zhao,Jun Xiao,Yueting Zhuang

doi:10.1145/3321505

Abstract

Visual Question Answering (VQA) is a challenging task that has gained increasing attention from both the computer vision and the natural language processing communities in recent years. Given a question in natural language, a VQA system is designed to automatically generate the answer according to the referenced visual content. Though there recently has been much intereset in this topic, the existing work of visual question answering mainly focuses on a single static image, which is only a small part of the dynamic and sequential visual data in the real world. As a natural extension, video question answering (VideoQA) is less explored. Because of the inherent temporal structure in the video, the approaches of ImageQA may be ineffectively applied to video question answering. In this article, we not only take the spatial and temporal dimension of video content into account but also employ an external knowledge base to improve the answering ability of the network. More specifically, we propose a knowledge-based progressive spatial-temporal attention network to tackle this problem. We obtain both objects and region features of the video frames from a region proposal network. The knowledge representation is generated by a word-level attention mechanism using the comment information of each object that is extracted from DBpedia. Then, we develop a question-knowledge-guided progressive spatial-temporal attention network to learn the joint video representation for video question answering task. We construct a large-scale video question answering dataset. The extensive experiments based on two different datasets validate the effectiveness of our method.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Apr 30, 2019
Citations: 13

Similar Papers

Spatiotemporal-Textual Co-Attention Network for Video Question Answering
Zheng-Jun Zha ... Jiawei Liu
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15
Zheng-Jun Zha, et. al.Zheng-Jun Zha ... Jiawei Liu
30 Apr 2019
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15

Visual Question Answering: Methodologies and Challenges
Liyana Sahir Kallooriyakath ... Bindu P V
-
Liyana Sahir Kallooriyakath, et. al.Liyana Sahir Kallooriyakath ... Bindu P V
09 Oct 2020
09 Oct 2020

VQAR: Review on Information Retrieval Techniques based on Computer Vision and Natural Language Processing
Shivangi Modi ... Dhatri Pandya
-
Shivangi Modi, et. al.Shivangi Modi ... Dhatri Pandya
01 Mar 2019
01 Mar 2019

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis
Yash Srivastava ... Shiv Ram Dubey
-
Yash Srivastava, et. al.Yash Srivastava ... Shiv Ram Dubey
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications