Heterogeneous Interactive Graph Network for Audio–Visual Question Answering

Yihan Zhao,Wei Xi,Xinhui Liu,Gairui Bai,Jizhong Zhao

doi:10.1016/j.knosys.2024.112165

Abstract

Audio–visual question answering (AVQA) is an emerging task that aims to provide answers by integrating visual contents, audio streams, and their associations within given videos. The major challenge lies in effectively fusing heterogeneous multi-modal data to comprehend complex scenes while capturing question-related clues to infer correct answers. Current AVQA models primarily employ attention mechanisms to extract question-related clues separately from visual and audio modalities before combining them. However, these approaches have two limitations: (1) They neglect the exploration of the association and complementary between audio and visual; (2) Encoding visual or audio holistically limits the capacity to capture the cross-modal and cross-temporal dynamic events. In this paper, we introduce the Heterogeneous Interactive Graph Network, a novel solution designed to address these limitations. Specifically, we construct heterogeneous multi-modal graphs that facilitate unified integration of multiple modalities, including visual, audio, and question. This approach effectively explores the associations and complementarity among multiple modalities, and it investigates local temporal interactions across visual and audio, enabling the effective capture of cross-modal and cross-temporal dynamic events. Additionally, we present a cross-modal feature alignment module, which acts as a bridge to overcome the semantic gap among heterogeneous multi-modal data. It promotes the convergence of multi-modal data distributions into a shared feature space, facilitating more effective and efficient processing. Extensive experimental results demonstrate the superiority of our method compared to state-of-the-art models across various question types on the challenging MUSIC-AVQA and AVQA benchmarks.

Full Text