Video Question Answering (VideoQA), aiming to correctly answer a given question based on understanding multimodal video content, is challenging due to the richness of the video content. From the perspective of video understanding, a complete VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named <inline-formula><tex-math notation="LaTeX">$\text{LiVLR}$</tex-math></inline-formula>. Specifically, <inline-formula><tex-math notation="LaTeX">$\text{LiVLR}$</tex-math></inline-formula> first utilizes graph-based visual and linguistic encoders to obtain multi-grained visual and linguistic representations, respectively. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (<inline-formula><tex-math notation="LaTeX">$\text{DaVL}$</tex-math></inline-formula>). <inline-formula><tex-math notation="LaTeX">$\text{DaVL}$</tex-math></inline-formula> distinguishes different types of representations with the learnable index embedding in graph embedding. Therefore, <inline-formula><tex-math notation="LaTeX">$\text{DaVL}$</tex-math></inline-formula> can flexibly adjust the importance of different representations when generating the question-related joint representation. The proposed <inline-formula><tex-math notation="LaTeX">$\text{LiVLR}$</tex-math></inline-formula> is lightweight and shows its performance advantage on three VideoQA benchmarks, MRSVTT-QA, KnowIT VQA, and TVQA. Extensive ablation studies demonstrate the effectiveness of the key components of <inline-formula><tex-math notation="LaTeX">$\text{LiVLR}$</tex-math></inline-formula>.
Read full abstract