Abstract

Video Question Answering (VideoQA) is a challenging multimodal task that requires the ability to recognize visual elements and reason relations in spatial and temporal dimensions according to the given video and question. Most existing GNN-based methods model the visual elements in a video as graph structures and reason relations between them. Despite the remarkable results of their work, they neglected that the question also has graph structure dependencies, which can be used to reason about relations between the video and the question. In this work, we propose a multimodal graph reasoning and fusion network that builds three graph neural networks for appearance, motion, and text sequences, respectively, and hierarchically reasons and fuses nodes from different modalities. Our proposed method achieves superior performance to several state-of-the-art methods on three benchmark datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call