Abstract

Video question answering (VideoQA) requires the ability of comprehensively understanding visual contents in videos. Existing VideoQA models mainly focus on scenarios involving a single event with simple object interactions and leave event-centric scenarios involving multiple events with dynamically complex object interactions largely unexplored. These conventional VideoQA models are usually based on features extracted from the global visual signals, making it difficult to capture the object-level and event-level semantics. Although there exists a recent work utilizing a static spatio-temporal graph to explicitly model object interactions in videos, it ignores the dynamic impact of questions for graph construction and fails to exploit the implicit event-level semantic clues in questions. To overcome these limitations, we propose a Self-supervised Dynamic Graph Reasoning (SDGraphR) model for video question answering (VideoQA). Our SDGraphR model learns a question-guided spatio-temporal graph that dynamically encodes intra-frame spatial correlations and inter-frame correspondences between objects in the videos. Furthermore, the proposed SDGraphR model discovers event-level cues from questions to conduct self-supervised learning with an auxiliary event recognition task, which in turn helps to improve its VideoQA performances without using any extra annotations. We carry out extensive experiments to validate the substantial improvements of our proposed SDGraphR model over existing baselines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.