Abstract

The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high- or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasets.

Highlights

  • During the 2020 edition of the DCASE Challenge, we introduced an approach that increased the performance of a Sound Event Detection (SED) system based in convolutional-recurrent neural networks (CRNN) by using several time-frequency resolutions in the process of mel-spectrogram feature extraction, and combining the outputs obtained with up to five different time-frequency resolution points

  • We offer an analysis of the performance of single-resolution and multiresolution SED systems when facing adverse acoustic scenarios that critically affect the spectra of the acoustic signals or their dynamic range, as well as situations in which the acoustic events are noticeably overlapped in time

  • The system performance is measured by means of the F1 score metric, which is computed as a combination of the True Positive (TP), False Positive (FP) and False Negative (FN) counts [32]: F1 =

Read more

Summary

Introduction

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Humans are able to identify the occurrences of our near environment using only acoustic information, namely, by hearing the sounds that are produced by those occurrences. It is sufficient to hear the knocking of a door to understand the underlying event and act in consequence. In this case, the knock on the door would be an example of a sound event

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.