Abstract
In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder---decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of fill-in-the-blank, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have