Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding.

Valentina Sanguineti,Alessio Del Bue,Vittorio Murino,Pietro Morerio

doi:10.1109/tip.2022.3219228

Abstract

Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural microphones. However, acoustic images are typically generated by cumbersome and costly microphone arrays which are not as widespread as ordinary microphones. This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding. We propose three architectures inspired by Variational Autoencoder, U-Net and adversarial models, and we assess their advantages and drawbacks. Such models are trained to generate spatialized audio by conditioning them to the associated video sequence and its corresponding monaural audio track. Our models are trained using the data collected by a microphone array as ground truth. Thus they learn to mimic the output of an array of microphones in the very same conditions. We assess the quality of the generated acoustic images considering standard generation metrics and different downstream tasks (classification, cross-modal retrieval and sound localization). We also evaluate our proposed models by considering multimodal datasets containing acoustic images, as well as datasets containing just monaural audio signals and RGB video frames. In all of the addressed downstream tasks we obtain notable performances using the generated acoustic data, when compared to the state of the art and to the results obtained using real acoustic images as input.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Lead the way for us

Journal: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society	Publication Date: Jan 1, 2022
Citations: 6

Similar Papers

Audio-Visual Localization by Synthetic Acoustic Image Generation
Valentina Sanguineti ... Alessio Del Bue
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35
Valentina Sanguineti, et. al.Valentina Sanguineti ... Alessio Del Bue
18 May 2021
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35

Synthetic Acoustic Image Generation for Audio-Visual Localization
...
-
, et. al. ...
02 Feb 2022
02 Feb 2022

Sound Localization and Separation in 3D Space Using a Single Microphone with a Metamaterial Enclosure.
Xuecong Sun ... Jun Yang
Advanced science (Weinheim, Baden-Wurttemberg, Germany) | VOL. 7
Xuecong Sun, et. al.Xuecong Sun ... Jun Yang
27 Dec 2019
Advanced science (Weinheim, Baden-Wurttemberg, Germany) | VOL. 7

Real-Time Sound Source Localization in Robots Using Fly Ormia Ochracea Inspired MEMS Directional Microphone
Asif Ishfaque ... Byungki Kim
IEEE Sensors Letters | VOL. 7
Asif Ishfaque, et. al.Asif Ishfaque ... Byungki Kim
01 Jan 2023
IEEE Sensors Letters | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society