Abstract

It is extremely challenging to infer the answers for long questions and complex videos. Video question answering not only needs to capture clues from questions but also reasonably infer the certain clip or frame in videos. In the paper, we propose a novel method to understand the entire video and reasonably infer the answer to the question. We utilize a traditional attention mechanism combined with the multi-head structure to construct a differentiated attention module. Different from existing methods, our method is dedicated to obtaining differentiated features. Videos on our method are split into a few video clips, and there is a great overlap between video clips. Thus, simple using the self-attention mechanism to aggregate features will lead to excessive redundancy in the captured features. To tackle this issue, we propose a differentiated attention module consists of traditional attention mechanism and multi-head structure to focus on the core semantics and decode different clips or phrases. In addition, we also apply the differentiated attention block on question aggregation and video clues reasoning. We use different query attention loss (DQALoss) to solve the problem of question requiring stronger differentiation. Meanwhile, we propose to utilize the multi-modal factorized bilinear pooling method to solve multi-modal features reasoning and interaction. Our experiment shows that the proposed method outperforms existing methods on TGIF-QA datasets by large margins. The experimental results show the effectiveness of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call