Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Shaoning Xiao,Yimeng Li,Yunan Ye,Zhou Zhao,Long Chen,Jian Shao,Shiliang Pu,Jun Xiao

doi:10.1007/s11063-019-10003-1

Abstract

This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Abstract

Talk to us

Similar Papers

More From: Neural Processing Letters

Lead the way for us

Journal: Neural Processing Letters	Publication Date: Feb 13, 2019
Citations: 8

Similar Papers

Video question answering via multi-granularity temporal attention network learning
Shaoning Xiao ... Zhou Zhao
-
Shaoning Xiao, et. al.Shaoning Xiao ... Zhou Zhao
17 Aug 2018
17 Aug 2018

Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks.
Zhou Zhao ... Chujie Lu
IEEE Transactions on Image Processing | VOL. 29
Zhou Zhao, et. al.Zhou Zhao ... Chujie Lu
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 29

Video question answering via grounded cross-attention network learning
Yunan Ye ... Jun Xiao
Information Processing & Management | VOL. 57
Yunan Ye, et. al.Yunan Ye ... Jun Xiao
16 Apr 2020
Information Processing & Management | VOL. 57

Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
Zhou Zhao ... Yueting Zhuang
-
Zhou Zhao, et. al.Zhou Zhao ... Yueting Zhuang
01 Jul 2018
01 Jul 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

Abstract

Talk to us

Similar Papers

More From: Neural Processing Letters