Abstract
The lack of strongly labeled data can limit the potential of a Sound Event Detection (SED) system trained using deep learning approaches. To address this issue, this paper proposes a novel method to approximate strong labels for the weakly labeled data using Nonnegative Matrix Factorization (NMF) in a supervised manner. Using a combinative transfer learning and semi-supervised learning framework, two different Convolutional Neural Networks (CNN) are trained using synthetic data, approximated strongly labeled data, and unlabeled data where one model will produce the audio tags. In contrast, the other will produce the frame-level prediction. The proposed methodology is then evaluated on three different subsets of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 dataset: validation dataset, challenge evaluation dataset, and public YouTube evaluation dataset. Based on the results, our proposed methodology outperforms the baseline system by a minimum of 7% across these three different data subsets. In addition, our proposed method also outperforms the top 3 submissions from the DCASE 2019 challenge task 4 on the validation and public YouTube evaluation datasets. Our system performance is also competitive against the top submission in DCASE 2020 challenge task 4 on the challenge evaluation data. A post-challenge analysis was also performed using the validation dataset, which revealed the causes of the performance difference between our system and the top submission of the DCASE 2020 challenge task 4. The leading causes that we observed are 1) detection threshold tuning method and 2) augmentation techniques used. We observed that our system could perform better than the first place submission by 1.5% by changing our detection threshold tuning method. In addition, the post-challenge analysis also revealed that our system is more robust than the top submission in DCASE 2020 challenge task 4 on long-duration audio clips, where we outperformed them by 37%.
Highlights
A N auditory scene is made up of several different sound events which overlap in time and frequency, resulting in a complex array of acoustic information reaching the human’s ears
This paper focuses on Sound Event Detection (SED), which reflects an aspect of the human auditory system
We propose a novel methodology to label the weakly labeled data, where only the event tags are known with certainty, using Nonnegative Matrix Factorization (NMF) [14] in a supervised manner
Summary
A N auditory scene is made up of several different sound events which overlap in time and frequency, resulting in a complex array of acoustic information reaching the human’s ears. Whereas the second subtask can be referred to the temporal localization [3] Such a problem is more likely to be solved when one had access to a large corpus of strongly labeled data where the event tags and corresponding onsets and offsets are known with certainty. Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS posals submitted to the annual Detection and Classification of Acoustic Scenes and Events (DCASE) challenge task 4 indicate that weak labels can be an effective alternative to train a SED system [6]–[9]. We propose a novel methodology to label the weakly labeled data, where only the event tags are known with certainty, using NMF [14] in a supervised manner.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.