Differentiated Attention with Multi-modal Reasoning for Video Question Answering

Shentao Yao,Kewei Wu,Kun Li,Kun Xing,Dan Guo,Zhao Xie

doi:10.1109/eebda53927.2022.9744732

Abstract

It is extremely challenging to infer the answers for long questions and complex videos. Video question answering not only needs to capture clues from questions but also reasonably infer the certain clip or frame in videos. In the paper, we propose a novel method to understand the entire video and reasonably infer the answer to the question. We utilize a traditional attention mechanism combined with the multi-head structure to construct a differentiated attention module. Different from existing methods, our method is dedicated to obtaining differentiated features. Videos on our method are split into a few video clips, and there is a great overlap between video clips. Thus, simple using the self-attention mechanism to aggregate features will lead to excessive redundancy in the captured features. To tackle this issue, we propose a differentiated attention module consists of traditional attention mechanism and multi-head structure to focus on the core semantics and decode different clips or phrases. In addition, we also apply the differentiated attention block on question aggregation and video clues reasoning. We use different query attention loss (DQALoss) to solve the problem of question requiring stronger differentiation. Meanwhile, we propose to utilize the multi-modal factorized bilinear pooling method to solve multi-modal features reasoning and interaction. Our experiment shows that the proposed method outperforms existing methods on TGIF-QA datasets by large margins. The experimental results show the effectiveness of our method.

Full Text