Decoupling Temporal Convolutional Networks Model in Sound Event Detection and Localization

Shen Song Shen Song,Xinyuan You Cong Zhang,Cong Zhang Shen Song

doi:10.53106/160792642023012401009

Abstract

<p>Sound event detection is sensitive to the network depth, and the increase of the network depth will lead to a decrease in the event detection ability. However, event localization has a deeper requirement for the network depth. In this paper, the accuracy of the joint task of event detection and localization is improved by decoupling SELD-TCN. The joint task is reflected in the early fusion of primary features and the enhancement of the generalization ability of the sound event detection branch as the DOA branch mask, while the advanced feature extraction and recognition of the two branches are carried out in different ways separately. The primary features extracted by resnet16-dilated instead of CNN-Pool. The SED branch adopts linear temporal convolution to realize sound event detection by imitating the linear classifier, and ED-TCN is used for the localization detection branch. The joint training of the DOA branch and the SED branch will affect each other badly. Using the most appropriate way for both branches and masking the DOA branch with the SED branch can improve the performance of both. In the TUT Sound Events 2019 dataset, the DOA error achieved an error effect of 6.73, 8.8 and 30.7 with no overlapping source data, with two and three overlapping sources, respectively. The SED accuracy has been significantly improved, and the DOA error has been significantly reduced.</p> <p>&nbsp;</p>

Full Text