Abstract

Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for SED. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have recently been developed. In this paper, it is experimentally shown that the trained SED model is able to contribute to the direction of arrival estimation (DOAE). However, joint training of SED and DOAE degrades the performance of both. Based on these results, a two-stage polyphonic sound event detection and localization method is proposed. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. The proposed method is evaluated on the DCASE 2019 Task 3 dataset, which contains different overlapping sound events in different environments. Experimental results show that the proposed method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.

Highlights

  • Sound event detection is a rapidly developing research area that aims to analyze and recognize a variety of sounds in urban and natural environments

  • The results of direction of arrival (DOA) and DOANT show that with trained convolutional neural networks (CNNs) layers transferred, DOA error is consistently lower than not transferring, which indicates that Sound event detection (SED) information contributes to the direction of arrival estimation (DOAE) performance; it can be observed that the convergence speed is much faster with CNN layers transferred

  • Comparing SELDnet with DOA-NT, it shows that the joint training is better than the training of DOAE without CNN layers transferred, which proves SED contributes to DOAE

Read more

Summary

Introduction

Sound event detection is a rapidly developing research area that aims to analyze and recognize a variety of sounds in urban and natural environments. Due to their success in image recognition, convolutional neural networks (CNNs) have become the prevailing architecture in this area [7,8,9,10]. Such methods use suitable time-frequency representations of audio, which are analogous to the image inputs in computer vision. Another popular type of neural network is the recurrent neural network (RNN), which has the ability to learn long temporal patterns present in the data, making it suitable for SED [11]. Hybrids containing both CNN and RNN layers, known as convolutional recurrent neural networks (CRNNs), have been proposed, which have led to state-of-the-art performance in SED [4, 12]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.