Abstract

One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don’t use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system.

Highlights

  • People rely on sounds in the environment to obtain important information

  • The MRNN-Att network is based on the ML-local block coordinate descent (LoBCoD)-NET which is driven by the ML-LoBCoD algorithm

  • The F1 score of the MCRNN-Att model was 0.45% higher than that of the GCRNN-Att model. These results indicate that the MRNN-Att model and the MCRNN-Att model were better than the GCRNN-Att model, and the extracted feature of the ML-LoBCoD-NET was effective

Read more

Summary

INTRODUCTION

People rely on sounds in the environment to obtain important information. Sound event detection (SED) can detect specific audio events from audio recordings, estimate the starting and offset locations of sound events, and provide a label for each event. J. Wang et al.: Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models trained, and the neural networks can output the results [5]. Inspired by the mean teacher model to solve the semi-supervised problem, this paper proposes two mean teacher models for sound event detection tasks in the domestic environment. The first our proposed mean teacher model is the MRNN-Att-MT, and the student model is the MRNN-Att. The second our proposed mean teacher model is the MCRNN-MT, and the student model is the MCRNN-Att. The weakly-labeled sound event detection task is the core problem, the proposed MRNN-Att network is the core method in this paper.

BACKGROUND
THE PROPOSED ML-LoBCoD ALGORITHM
THE PROPOSED MRNN-ATT-MT MODEL FOR SOUND
EXPERIMENTAL RESULTS AND ANALYSIS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call