Lightweight deep neural networks for acoustic scene classification and an effective visualization for presenting sound scene contexts

Lam Pham,Dat Ngo,Dusan Salovic,Anahid Jalali,Alexander Schindler,Phu X Nguyen,Khoa Tran,Hai Canh Vu

doi:10.1016/j.apacoust.2023.109489

Lam Pham, Dat Ngo + Show 6 more

https://doi.org/10.1016/j.apacoust.2023.109489

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In this paper, we propose lightweight deep neural networks for Acoustic Scene Classification (ASC) and a visualization method for presenting a sound scene context. To this end, we first propose an inception-based and low-memory footprint ASC model as the ASC baseline. The ASC baseline is then compared with benchmark and high-complexity network architectures. Next, we improve the ASC baseline by proposing a novel deep neural network architecture which leverages a residual-inception architecture and multiple kernels. Given the novel residual-inception (NRI) based model, we apply multiple techniques of model compression to evaluate the trade off between the model complexity and the model accuracy performance. Finally, we evaluate whether sound events detected in a sound scene recording can help to improve ASC accuracy performance and to present the sound scene context more comprehensively. We conduct extensive experiments on various ASC datasets, including sound scene datasets proposed for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 1A and 1B, 2019 Task 1A and 1B, 2020 Task 1A, 2021 Task 1A, and 2022 Task 1. Our experimental results on several different ASC challenges highlight two main achievements. First, given the analysis of the trade off between the model performance and the model complexity, we propose two low-complexity ASC models: The medium-size model (MM) presents 4.96 M trainable parameters, 19.3 MB memory occupation, and 7.12 BFLOPs; The small-size model (SM) presents a very low complexity of 120 K trainable parameters, 120 KB memory occupation, and 0.82 BFLOPs. These ASC systems are very competitive to the state-of-the-art systems and compatible for real-life applications on a wide range of edge devices. Secondly, from the analysis of the role of sound events in a sound scene, we propose an effective visualization method for comprehensively presenting a sound scene context. By combining both the sound scene and sound event information, the visualization method not only indicates predicted sound scene contexts with high probabilities but also provides statistics of sound events occurring in these sound scene contexts.

Full Text