Abstract
Sound event localization and detection (SELD) involves the detection of sound events (SED) and the estimation of their direction of arrival (DoA) by utilizing multichannel sound signals. Recent research in SELD has predominantly focused on deep neural network (DNN) based models, which specifically emphasize learning temporal context. Examples of these models include the convolutional recurrent neural network (CRNN) and the ResNet-conformer architecture, which handle spectral and channel information only as the embeddings of temporal features. To fully exploit spectral information providing a crucial cue for both SED and DoA, it is imperative to devise a network architecture that effectively learns both spectral and temporal contexts. In this regard, we propose a divided transformer architecture that separately identifies the spectral and temporal context to encourage the model to learn more spectral characteristics of signals while retaining the temporal context. The efficacy of the divided spectro-temporal transformer approach is validated using the DCASE 2022 and 2023 challenge task 3 datasets. Furthermore, a series of parameter studies carried out to optimize the performance of SELD demonstrates that the number of frequency bins for attention and the pooling location impact the performance, and the divided spectro-temporal transformer is beneficial for both SED and DoA.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.