Abstract

Audio scene analysis (ASA) is a challenging and multifaceted task in audio signal processing that uncovers information about the nature of an audio recording. Regardless of the analysis goal, a number of audio sources are observed in any audio scene. However, this consideration is usually not explored or given considerable thought in research. This work aims to demonstrate the utility of audio source counting with a novel solution consisting of a multimodal system for ASA. Both speaker counting and sound event counting techniques use deep neural networks (DNN) to predict the number of sources. We are able to present competitive results for audio source counting by achieving prediction accuracy of 46.03% and 89.57% with a margin of error of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\pm 1$</tex-math></inline-formula> for speaker counting, which outperforms state-of-the-art systems for similar tasks. For sound event counting we achieve 50.55% and 86.59% prediction accuracy and accuracy with a margin of error of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\pm 1$</tex-math></inline-formula> , respectively, that establishes a clear baseline. Our system also demonstrates real-time aspects with an overall processing time of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim 0.4614$</tex-math></inline-formula> s per audio recording.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call