Abstract

Video Question Answering is a burgeoning and challenging task in visual information retrieval (VIR), which automatically generates the answer to a question based on referenced video content. Different from the existing visual question answering methods which mainly focus on static image content, video question answering takes temporal dimension into account because of the essential difference in the structure between image and video. In this paper, we study the problem of video question answering from the viewpoint of grounded cross-attention network learning. Specifically, we propose a novel hierarchical cross-attention mechanism of mutual attention learning for video question answering, named as GCANet. We first obtain the multi-level rough video representation from frame-level video features and clip-level video features. Then, we utilize region proposal network to generate object-level grounded video features as grounded video representations. Next, the grounded question-video representation is learned by the first layer of the GCANet framework, named as Q−O cross-attention layer. The second Q−V−H cross-attention layer of the GCANet framework helps to learn the joint question-video representation based on both rough representation and grounded representation of video for video question answering. We construct two large-scale video question answering datasets. The experimental results on the proposed datasets demonstrate the effectiveness of our model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call