Abstract
This paper proposes a sound event localization and detection (SELD) method using a convolutional recurrent neural network (CRNN) with gated linear units (GLUs). The proposed method introduces to employ GLUs with convolutional neural network (CNN) layers of the CRNN to extract adequate spectral features from amplitude and phase spectra. When the CNNs extract features of high-dimensional dependencies of frequency bins, the GLUs weight the extracted features based on the importance of the bins, like attention mechanism. Extracted features from bins where sounds are absent, which is not informative and degrade the SELD performance, are weighted to 0 and ignored by GLUs. Only the features extracted from informative bins are used for the CNN output for better SELD performance. Obtained CNN outputs are fed to consecutive bi-directional gated recurrent units (GRUs), which capture temporal information. Finally, the GRU output are shared by two task-specific layers, which are sound event detection (SED) layers and direction of arrival (DoA) estimation layers, to obtain SELD results. Evaluation results using the TAU Spatial Sound Events 2019 - Ambisonic dataset show the effectiveness of GLUs in the proposed method, and it improves SELD performance up to 0.10 in F1-score, 0.15 in error rate, 16.4° in DoA estimation error comparing to a CRNN baseline method.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.