Polyphonic sound event detection (SED) is an emerging area with many applications for smart disaster safety, security, life logging, etc. This paper proposes a two-stage polyphonic SED model when strongly labeled data are limited but weakly labeled and unlabeled data are available. The first stage of the proposed SED model is constructed by a residual convolutional recurrent neural network (RCRNN)-based mean teacher model with convolutional block attention module (CBAM)-based attention. Then, the second stage fine-tunes the student model from the first stage by applying the proposed semi-supervised loss function to accommodate the noisy targets of weakly labeled and unlabeled data. The proposed SED model is applied to both Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge Task 4 and DCASE 2020 Challenge Task 4, and its performance is compared with those of the baseline and top-ranked models from both challenges by measuring the F1-score and polyphonic sound detection score (PSDS). The experiments show that the RCRNN-based first-stage model with CBAM-based attention achieves a higher F1-score and PSDS than the baseline and top-ranked models for both challenges. Furthermore, the proposed two-stage SED model with the semi-supervised loss function improves the F1-score by 6.1% and 4.6% compared to the top-ranked models from DCASE 2019 and 2020, respectively.
Read full abstract