Abstract

This study focuses on the design of a loss function for a deep neural network (DNN)-based model with two branches, which is used to solve sound event localization and detection (SELD) on low-resource realistic data. To this end, we employ a secondary network for audio classification, which provides global event information to the main network, enabling it to make robust SELD predictions. Furthermore, we suggest utilizing a momentum strategy for direction-of-arrival (DOA) estimation, taking advantage of the strong temporal consistency of sound events, thereby effectively reducing localization error. Lastly, we incorporate a regularization term into the loss function to alleviate the overfitting problem on the small dataset. We evaluate our proposed methods on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Task 3 dataset, and the results demonstrate consistent improvements in SELD performance. In comparison to the baseline system, the proposed loss function yields significantly improved results for both localization and detection metrics on realistic data. Moreover, the proposed loss function demonstrates its ability to generalize across different network architectures, as evidenced by the consistent improvements achieved.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.