Abstract

Sound event localization and detection have been applied in various fields. Due to the polyphony and noise interference, it becomes challenging to accurately predict the sound event and their occurrence locations. Aiming at this problem, we propose a Multiple Attention Fusion ResNet, which uses ResNet34 as the base network. Given the situation that the sound duration is not fixed, and there are multiple polyphonic and noise, we introduce the Gated Channel Transform to enhance the residual basic block. This enables the model to capture contextual information, evaluate channel weights, and reduce the interference caused by polyphony and noise. Furthermore, Split Attention is introduced to the model for capturing cross-channel information, which enhances the ability to distinguish the polyphony. Finally, Coordinate Attention is introduced to the model so that the model can focus on both the channel information and spatial location information of sound events. Experiments were conducted on two different datasets, TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021. The results demonstrate that the proposed model significantly outperforms state-of-the-art methods under multiple polyphonic and noise-directional interference environments and it achieves competitive performance under a single polyphonic environment.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.