Abstract

The lack of strongly labeled data can limit the potential of a Sound Event Detection (SED) system trained using deep learning approaches. To address this issue, this paper proposes a novel method to approximate strong labels for the weakly labeled data using Nonnegative Matrix Factorization (NMF) in a supervised manner. Using a combinative transfer learning and semi-supervised learning framework, two different Convolutional Neural Networks (CNN) are trained using synthetic data, approximated strongly labeled data, and unlabeled data where one model will produce the audio tags. In contrast, the other will produce the frame-level prediction. The proposed methodology is then evaluated on three different subsets of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 dataset: validation dataset, challenge evaluation dataset, and public YouTube evaluation dataset. Based on the results, our proposed methodology outperforms the baseline system by a minimum of 7% across these three different data subsets. In addition, our proposed method also outperforms the top 3 submissions from the DCASE 2019 challenge task 4 on the validation and public YouTube evaluation datasets. Our system performance is also competitive against the top submission in DCASE 2020 challenge task 4 on the challenge evaluation data. A post-challenge analysis was also performed using the validation dataset, which revealed the causes of the performance difference between our system and the top submission of the DCASE 2020 challenge task 4. The leading causes that we observed are 1) detection threshold tuning method and 2) augmentation techniques used. We observed that our system could perform better than the first place submission by 1.5% by changing our detection threshold tuning method. In addition, the post-challenge analysis also revealed that our system is more robust than the top submission in DCASE 2020 challenge task 4 on long-duration audio clips, where we outperformed them by 37%.

Highlights

  • A N auditory scene is made up of several different sound events which overlap in time and frequency, resulting in a complex array of acoustic information reaching the human’s ears

  • This paper focuses on Sound Event Detection (SED), which reflects an aspect of the human auditory system

  • We propose a novel methodology to label the weakly labeled data, where only the event tags are known with certainty, using Nonnegative Matrix Factorization (NMF) [14] in a supervised manner

Read more

Summary

INTRODUCTION

A N auditory scene is made up of several different sound events which overlap in time and frequency, resulting in a complex array of acoustic information reaching the human’s ears. Whereas the second subtask can be referred to the temporal localization [3] Such a problem is more likely to be solved when one had access to a large corpus of strongly labeled data where the event tags and corresponding onsets and offsets are known with certainty. Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS posals submitted to the annual Detection and Classification of Acoustic Scenes and Events (DCASE) challenge task 4 indicate that weak labels can be an effective alternative to train a SED system [6]–[9]. We propose a novel methodology to label the weakly labeled data, where only the event tags are known with certainty, using NMF [14] in a supervised manner.

RELATED WORK
PROPOSED METHODOLOGY
APPROXIMATING STRONG LABELS USING NMF
SEMI-SUPERVISED LEARNING
2: Output
EXPERIMENT SETUP
Methodology
Findings
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call