Uncovering the Temporal Context for Video Question Answering

Linchao Zhu,Yi Yang,Zhongwen Xu,Alexander G Hauptmann

doi:10.1007/s11263-017-1033-7

Abstract

In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder---decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of fill-in-the-blank, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.

Full Text