Multimodal System for Audio Scene Source Counting and Analysis

Michael Nigro,Sridhar Krishnan

doi:10.1109/taslp.2022.3156795

Michael Nigro, Sridhar Krishnan

Open Access

https://doi.org/10.1109/taslp.2022.3156795

Copy DOI

Abstract

Audio scene analysis (ASA) is a challenging and multifaceted task in audio signal processing that uncovers information about the nature of an audio recording. Regardless of the analysis goal, a number of audio sources are observed in any audio scene. However, this consideration is usually not explored or given considerable thought in research. This work aims to demonstrate the utility of audio source counting with a novel solution consisting of a multimodal system for ASA. Both speaker counting and sound event counting techniques use deep neural networks (DNN) to predict the number of sources. We are able to present competitive results for audio source counting by achieving prediction accuracy of 46.03% and 89.57% with a margin of error of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\pm 1$</tex-math></inline-formula> for speaker counting, which outperforms state-of-the-art systems for similar tasks. For sound event counting we achieve 50.55% and 86.59% prediction accuracy and accuracy with a margin of error of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\pm 1$</tex-math></inline-formula> , respectively, that establishes a clear baseline. Our system also demonstrates real-time aspects with an overall processing time of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim 0.4614$</tex-math></inline-formula> s per audio recording.

Full Text