Supp1-3137988.pdf

Di Hu

doi:10.1109/tpami.2021.3137988/mm1

Abstract

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the typical visual task of object detection, obtaining reasonable performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Supp1-3137988.pdf

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Class-Aware Sounding Objects Localization via Audiovisual Correspondence.
Di Hu ... Weiyao Lin
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44
Di Hu, et. al.Di Hu ... Weiyao Lin
01 Dec 2022
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44

Advanced AudioBIFS: Virtual Acoustics Modeling in MPEG-4 Scene Description
R Vaananen ... J Huopaniemi
IEEE Transactions on Multimedia | VOL. 6
R Vaananen, et. al.R Vaananen ... J Huopaniemi
01 Oct 2004
IEEE Transactions on Multimedia | VOL. 6

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching
...
-
, et. al. ...
11 Oct 2020
11 Oct 2020

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation
Shaofei Huang ... Jizhong Han
-
Shaofei Huang, et. al.Shaofei Huang ... Jizhong Han
01 Aug 2023
01 Aug 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Supp1-3137988.pdf

Abstract

Talk to us

Similar Papers