Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Yuting Zhou,Hongjie Wan

doi:10.1186/s13636-023-00292-9

Abstract

The goal of sound event detection and localization (SELD) is to identify each individual sound event class and its activity time from a piece of audio, while estimating its spatial location at the time of activity. Conformer combines the advantages of convolutional layers and Transformer, which is effective in tasks such as speech recognition. However, it achieves high performance relying on complex network structure and a large number of computations. In the SELD task of this paper, we propose to use an encoder with a simpler network structure, called the dual-branch attention module (DBAM). The module is improved based on the conformer using two parallel branches of attention and convolution, which can model both global and local contextual information. We also blend low-level and high-level features of the localization task. In addition, we add soft parameter sharing to the joint SELD network, which can efficiently exploit the potential relationship between the two subtasks, SED and DOA. The proposed method can effectively detect two sound events with overlapping occurrence in the same time period. We experimented with the open dataset DCASE 2020 task 3 proving that the proposed method achieves better SELD performance than the baseline model. Furthermore, we conducted ablation experiments for verifying the effectiveness of the dual-branch attention module and soft parameter sharing.

Full Text