Abstract

Audio–visual question answering (AVQA) is an emerging task that aims to provide answers by integrating visual contents, audio streams, and their associations within given videos. The major challenge lies in effectively fusing heterogeneous multi-modal data to comprehend complex scenes while capturing question-related clues to infer correct answers. Current AVQA models primarily employ attention mechanisms to extract question-related clues separately from visual and audio modalities before combining them. However, these approaches have two limitations: (1) They neglect the exploration of the association and complementary between audio and visual; (2) Encoding visual or audio holistically limits the capacity to capture the cross-modal and cross-temporal dynamic events. In this paper, we introduce the Heterogeneous Interactive Graph Network, a novel solution designed to address these limitations. Specifically, we construct heterogeneous multi-modal graphs that facilitate unified integration of multiple modalities, including visual, audio, and question. This approach effectively explores the associations and complementarity among multiple modalities, and it investigates local temporal interactions across visual and audio, enabling the effective capture of cross-modal and cross-temporal dynamic events. Additionally, we present a cross-modal feature alignment module, which acts as a bridge to overcome the semantic gap among heterogeneous multi-modal data. It promotes the convergence of multi-modal data distributions into a shared feature space, facilitating more effective and efficient processing. Extensive experimental results demonstrate the superiority of our method compared to state-of-the-art models across various question types on the challenging MUSIC-AVQA and AVQA benchmarks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.