Weakly and semi-supervised learning for sound event detection using image pretrained convolutional recurrent neural network, weighted pooling and mean teacher method

Xichang Cai,Dongchi Yu,Menglong Wu,Duxin Liu

doi:10.1088/1742-6596/2010/1/012108

Xichang Cai, Dongchi Yu + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/2010/1/012108

Copy DOI

Abstract

In this paper, we propose a sound event detection (SED) method which uses deep neural network trained on weak labeled and unlabeled data. The proposed method utilizes a convolutional recurrent neural network (CRNN) to extract high level features of audio clips. Inspired by the impressive performance of transfer learning in the field of image recognition, the convolutional neural network (CNN) in the proposed CRNN is an image-pretrained model. Although there is a significant difference between audio and image, the image-pretrained CNN still has competitive performance in SED and can effectively reduce the amount of training data needed. To learn from weak labeled data, the proposed method utilizes a weighted pooling strategy which enables the network to focus on the frames containing events in an audio clip. For unlabeled data, the proposed method utilizes a mean teacher semi-supervised learning method and data augmentation technique. To demonstrate the performance of the proposed method, we conduct the experimental evaluation using the DCASE2021 Task4 dataset. The experimental results demonstrate that the proposed method outperforms the DCASE2021 Task4 baseline method.

Full Text