Abstract

In this paper, we propose a sound event detection (SED) method which uses deep neural network trained on weak labeled and unlabeled data. The proposed method utilizes a convolutional recurrent neural network (CRNN) to extract high level features of audio clips. Inspired by the impressive performance of transfer learning in the field of image recognition, the convolutional neural network (CNN) in the proposed CRNN is an image-pretrained model. Although there is a significant difference between audio and image, the image-pretrained CNN still has competitive performance in SED and can effectively reduce the amount of training data needed. To learn from weak labeled data, the proposed method utilizes a weighted pooling strategy which enables the network to focus on the frames containing events in an audio clip. For unlabeled data, the proposed method utilizes a mean teacher semi-supervised learning method and data augmentation technique. To demonstrate the performance of the proposed method, we conduct the experimental evaluation using the DCASE2021 Task4 dataset. The experimental results demonstrate that the proposed method outperforms the DCASE2021 Task4 baseline method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call