Abstract

Conventional methods for video question answering focus on encoding video and questions in sequence, and carefully de-signing multi-modal interactions to fuse multi-modal infor-mation. Although achieving promising results, these meth-ods mainly focus on the sequential structure of video and fail to explore the underlying hierarchical semantic structure of the video. In this work, we argue that video content is se-quential in time space is organized in a hierarchical structure in semantic space (e.g., object-action-scene). Corresponding to the complex video structure, the questions also involve multi-granularity queries in the video. To cope with queries of different granularities in videos, we propose an Object-to-Scene relational reasoning (O2SR) framework, which en-codes videos in multi-granularities in the semantic space under the guidance of questions. By modeling the local object relations and the global scene dependencies while encapsu-lating corresponding questions semantic into visual elements, the O2SR achieves better generalization on different types of questions. Experimental evaluations show our method achieves state-of-the-art performance on four datasets. Code will be released at: https://github.com/zophe98/O2SR.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.