Divided spectro-temporal transformer for sound event localization and detection in real scenes

Yusun Shul,Jung-Woo Choi

doi:10.1121/10.0023458

Abstract

Sound event localization and detection (SELD) involves the detection of sound events (SED) and the estimation of their direction of arrival (DoA) by utilizing multichannel sound signals. Recent research in SELD has predominantly focused on deep neural network (DNN) based models, which specifically emphasize learning temporal context. Examples of these models include the convolutional recurrent neural network (CRNN) and the ResNet-conformer architecture, which handle spectral and channel information only as the embeddings of temporal features. To fully exploit spectral information providing a crucial cue for both SED and DoA, it is imperative to devise a network architecture that effectively learns both spectral and temporal contexts. In this regard, we propose a divided transformer architecture that separately identifies the spectral and temporal context to encourage the model to learn more spectral characteristics of signals while retaining the temporal context. The efficacy of the divided spectro-temporal transformer approach is validated using the DCASE 2022 and 2023 challenge task 3 datasets. Furthermore, a series of parameter studies carried out to optimize the performance of SELD demonstrates that the number of frequency bins for attention and the pooling location impact the performance, and the divided spectro-temporal transformer is beneficial for both SED and DoA.

Full Text