Video question answering via grounded cross-attention network learning

Yunan Ye,Shifeng Zhang,Yimeng Li,Xufeng Qian,Siliang Tang,Shiliang Pu,Jun Xiao

doi:10.1016/j.ipm.2020.102265

Abstract

Video Question Answering is a burgeoning and challenging task in visual information retrieval (VIR), which automatically generates the answer to a question based on referenced video content. Different from the existing visual question answering methods which mainly focus on static image content, video question answering takes temporal dimension into account because of the essential difference in the structure between image and video. In this paper, we study the problem of video question answering from the viewpoint of grounded cross-attention network learning. Specifically, we propose a novel hierarchical cross-attention mechanism of mutual attention learning for video question answering, named as GCANet. We first obtain the multi-level rough video representation from frame-level video features and clip-level video features. Then, we utilize region proposal network to generate object-level grounded video features as grounded video representations. Next, the grounded question-video representation is learned by the first layer of the GCANet framework, named as Q−O cross-attention layer. The second Q−V−H cross-attention layer of the GCANet framework helps to learn the joint question-video representation based on both rough representation and grounded representation of video for video question answering. We construct two large-scale video question answering datasets. The experimental results on the proposed datasets demonstrate the effectiveness of our model.

Full Text