Sound Event Localization and Detection Using Convolutional Recurrent Neural Networks and Gated Linear Units

Tatsuya Komatsu,Masahito Togami,Tsubasa Takahashi

doi:10.23919/eusipco47968.2020.9287372

Abstract

This paper proposes a sound event localization and detection (SELD) method using a convolutional recurrent neural network (CRNN) with gated linear units (GLUs). The proposed method introduces to employ GLUs with convolutional neural network (CNN) layers of the CRNN to extract adequate spectral features from amplitude and phase spectra. When the CNNs extract features of high-dimensional dependencies of frequency bins, the GLUs weight the extracted features based on the importance of the bins, like attention mechanism. Extracted features from bins where sounds are absent, which is not informative and degrade the SELD performance, are weighted to 0 and ignored by GLUs. Only the features extracted from informative bins are used for the CNN output for better SELD performance. Obtained CNN outputs are fed to consecutive bi-directional gated recurrent units (GRUs), which capture temporal information. Finally, the GRU output are shared by two task-specific layers, which are sound event detection (SED) layers and direction of arrival (DoA) estimation layers, to obtain SELD results. Evaluation results using the TAU Spatial Sound Events 2019 - Ambisonic dataset show the effectiveness of GLUs in the proposed method, and it improves SELD performance up to 0.10 in F1-score, 0.15 in error rate, 16.4° in DoA estimation error comparing to a CRNN baseline method.

Full Text