Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

Linjun Li,Jian Wang,Wang Lin,Zhou Zhao,Yan Xia,Shuwen Xiao,Weihao Jiang,Hao Jiang,Tao Jin,Wenwen Pan

doi:10.1109/tcsvt.2023.3264524

Abstract

Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Jan 1, 2024
Citations: 2

Similar Papers

Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Cheng Xue ... Hao Chen
IEEE Transactions on Multimedia | VOL. 25
Cheng Xue, et. al.Cheng Xue ... Hao Chen
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

Audio-Visual Fusion for Film Database Retrieval and Classification
Paisarn Muneesawang ... Ling Guan
-
Paisarn Muneesawang, et. al.Paisarn Muneesawang ... Ling Guan
01 Jan 2014
01 Jan 2014

Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise
Ibrahim Almajai ... Jonathan Darch
-
Ibrahim Almajai, et. al.Ibrahim Almajai ... Jonathan Darch
17 Sep 2006
17 Sep 2006

End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis
Muhammad Muzammel ... Alice Othmani
Computer Methods and Programs in Biomedicine | VOL. 211
Muhammad Muzammel, et. al.Muhammad Muzammel ... Alice Othmani
28 Sep 2021
Computer Methods and Programs in Biomedicine | VOL. 211

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology