Decompose the Sounds and Pixels, Recompose the Events

Varshanth R Rao,Juwei Lu,Md Ibrahim Khalil,Haoda Li,Peng Dai

doi:10.1609/aaai.v36i2.20111

Abstract

In this paper, we propose a framework centering around a novel architecture called the Event Decomposition Recomposition Network (EDRNet) to tackle the Audio-Visual Event (AVE) localization problem in the supervised and weakly supervised settings. AVEs in the real world exhibit common unraveling patterns (termed as Event Progress Checkpoints(EPC)), which humans can perceive through the cooperation of their auditory and visual senses. Unlike earlier methods which attempt to recognize entire event sequences, the EDRNet models EPCs and inter-EPC relationships using stacked temporal convolutions. Based on the postulation that EPC representations are theoretically consistent for an event category, we introduce the State Machine Based Video Fusion, a novel augmentation technique that blends source videos using different EPC template sequences. Additionally, we design a new loss function called the Land-Shore-Sea loss to compactify continuous foreground and background representations. Lastly, to alleviate the issue of confusing events during weak supervision, we propose a prediction stabilization method called Bag to Instance Label Correction. Experiments on the AVE dataset show that our collective framework outperforms the state-of-the-art by a sizable margin.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Decompose the Sounds and Pixels, Recompose the Events

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence	Publication Date: Jun 28, 2022
Citations: 3

Similar Papers

Dense Modality Interaction Network for Audio-Visual Event Localization
Shuo Liu ... Yuan Liu
IEEE Transactions on Multimedia | VOL. 25
Shuo Liu, et. al.Shuo Liu ... Yuan Liu
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

Audio-Visual Event Localization in Unconstrained Videos
Yapeng Tian ... Jing Shi
-
Yapeng Tian, et. al.Yapeng Tian ... Jing Shi
01 Jan 2018
01 Jan 2018

Collaborative Audio-Visual Event Localization Based on Sequential Decision and Cross-Modal Consistency
Yuqian Kuang ... Xiaopeng Fan
-
Yuqian Kuang, et. al.Yuqian Kuang ... Xiaopeng Fan
04 Jun 2023
04 Jun 2023

Multimodal Attentive Fusion Network for audio-visual event recognition
Mathilde Brousmiche ... Stéphane Dupont
Information Fusion | VOL. 85
Mathilde Brousmiche, et. al.Mathilde Brousmiche ... Stéphane Dupont
08 Apr 2022
Information Fusion | VOL. 85

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Decompose the Sounds and Pixels, Recompose the Events

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence