Abstract

The number of publications on acoustic scene classification (ASC) in environmental audio recordings has constantly increased over the last few years. This was mainly stimulated by the annual Detection and Classification of Acoustic Scenes and Events (DCASE) competition with its first edition in 2013. All competitions so far involved one or multiple ASC tasks. With a focus on deep learning based ASC algorithms, this article summarizes and groups existing approaches for data preparation, i.e., feature representations, feature pre-processing, and data augmentation, and for data modeling, i.e., neural network architectures and learning paradigms. Finally, the paper discusses current algorithmic limitations and open challenges in order to preview possible future developments towards the real-life application of ASC systems.

Highlights

  • Recognizing different indoor and outdoor acoustic environments from recorded acoustic signals is an active research field that has received much attention in the last few years

  • Other related survey articles focus on deep learning methods for acoustic event detection (AED) [4,5] or summarize algorithms submitted for various machine listening tasks including acoustic scene classification (ASC) for a particular year of the DCASE challenge such as [6]

  • ASC or AED systems perform the tasks of interest on derived signal representations, which will be introduced

Read more

Summary

Introduction

Recognizing different indoor and outdoor acoustic environments from recorded acoustic signals is an active research field that has received much attention in the last few years. The task is an essential part of auditory scene analysis and involves summarizing an entire recorded acoustic signal using a pre-defined semantic description like “office room” or “public place”. Those semantic entities are denoted as acoustic scenes and the task of recognizing them as acoustic scene classification (ASC) [1]. A challenging task related to ASC is the detection of audio events that are temporarily present in an acoustic scene. Examples of such audio events include vehicles, car horns, and footsteps, among others. It must be noted that evaluating and comparing the effectiveness of different methods is often complicated by the use of different evaluation datasets

Evaluation and Deployment
Signal Representations
Fixed Signal Transformations
Learnable Signal Transformations
Pre-Processing
Data Augmentation Techniques
Network Architectures
Convolutional Neural Networks
Feedforward Neural Networks
Convolutional Recurrent Neural Networks
Learning Paradigms
Multiple Input Networks
Attention
Multitask Learning
Transfer Learning
Result Fusion
Domain Adaptation
Ambiguous Allocation between Sound Events and Scenes
Model Interpretability
Real-World Deployment
Conclusions and Future Directions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.