Multimodal Graph Reasoning and Fusion for Video Question Answering

Shuai Zhang,Liang Zhao,Ammar Hawbani,Xingfu Wang,Saeed Hamood Alsamhi

doi:10.1109/trustcom56396.2022.00199

Abstract

Video Question Answering (VideoQA) is a challenging multimodal task that requires the ability to recognize visual elements and reason relations in spatial and temporal dimensions according to the given video and question. Most existing GNN-based methods model the visual elements in a video as graph structures and reason relations between them. Despite the remarkable results of their work, they neglected that the question also has graph structure dependencies, which can be used to reason about relations between the video and the question. In this work, we propose a multimodal graph reasoning and fusion network that builds three graph neural networks for appearance, motion, and text sequences, respectively, and hierarchically reasons and fuses nodes from different modalities. Our proposed method achieves superior performance to several state-of-the-art methods on three benchmark datasets.

Full Text