Abstract

Sound Event Detection is a task with a rising relevance over the recent years in the field of audio signal processing, due to the creation of specific datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and the introduction of competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). The different categories of acoustic events can present diverse temporal and spectral characteristics. However, most approaches use a fixed time-frequency resolution to represent the audio segments. This work proposes a multi-resolution analysis for feature extraction in Sound Event Detection, hypothesizing that different resolutions can be more adequate for the detection of different sound event categories, and that combining the information provided by multiple resolutions could improve the performance of Sound Event Detection systems. Experiments are carried out over the DESED dataset in the context of the DCASE 2020 Challenge, concluding that the combination of up to 5 resolutions allows a neural network-based system to obtain better results than single-resolution models in terms of event-based F1-score in every event category and in terms of PSDS (Polyphonic Sound Detection Score). Furthermore, we analyze the impact of score thresholding in the computation of F1-score results, finding that the standard value of 0.5 is suboptimal and proposing an alternative strategy based in the use of a specific threshold for each event category, which obtains further improvements in performance.

Highlights

  • U NDERSTANDING the acoustic environment is an ongoing challenge for artificial intelligence which has motivated several research fields

  • Taking into account that the BS resolution point coincides with the baseline system of DCASE Challenge 2020 Task 4, the results obtained using this resolution constitute the common benchmark for the aforementioned task

  • The results of the single-resolution models are presented in Table 3 as the mean and the standard deviation of the F1-scores obtained with the five trainings

Read more

Summary

INTRODUCTION

U NDERSTANDING the acoustic environment is an ongoing challenge for artificial intelligence which has motivated several research fields. The use of two different resolutions has been proposed to improve automatic speech recognition in reverberant scenarios [23], in which a wide-context window gives information about the acoustic environment and reverberation, whereas a narrow-context window provides finer detail about the content of the speech signal This is possible due to the existence of a tradeoff between time resolution and frequency resolution in the extraction of Fast Fourier Transform-based audio features [24] such as the mel-spectrogram, which is the base for the analysis proposed in this work. The proposed analysis is tested using a state-of-the-art system, the baseline for DCASE 2020 Challenge Task 4 “Detection and Separation of Sound Events in Domestic Environments” [25] The aim of this challenge is to make use of unlabeled and weakly-labeled recordings, together with strongly-labeled synthetic audio clips, to train systems that predict the temporal locations of ten different event categories in audio recordings.

DESED DATASET
EXPERIMENTAL FRAMEWORK
MODEL FUSION
SCORE POST-PROCESSING
RESULTS AND DISCUSSION
SCORE THRESHOLDING RESULTS
EVENT OVERLAP ANALYSIS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call