Abstract

We propose using derivative features for sound event detection based on deep neural networks. As input to the networks, we used log-mel-filterbank and its first and second derivative features for each frame of the audio signal. Two deep neural networks were used to evaluate the effectiveness of these derivative features. Specifically, a convolutional recurrent neural network (CRNN) was constructed by combining a convolutional neural network and a recurrent neural networks (RNN) followed by a feed-forward neural network (FNN) acting as a classification layer. In addition, a mean-teacher model based on an attention CRNN was used. Both models had an average pooling layer at the output so that weakly labeled and unlabeled audio data may be used during model training. Under the various training conditions, depending on the neural network architecture and training set, the use of derivative features resulted in a consistent performance improvement by using the derivative features. Experiments on audio data from the Detection and Classification of Acoustic Scenes and Events 2018 and 2019 challenges indicated that a maximum relative improvement of 16.9% was obtained in terms of the F-score.

Highlights

  • Humans can obtain information about their surroundings from nearby sounds

  • The classification results on the DCASE 2018 test set are shown in Table 3, where “Single channel” implies that only the static log-mel filterbank was used as the input of the basic convolutional recurrent neural network (CRNN), and “Three channels” implies that derivative features were used as the input

  • We proposed the use of the first and second delta features of the log-mel filterbank to improve the performance of state-of-the-art CRNNs

Read more

Summary

Introduction

Humans can obtain information about their surroundings from nearby sounds. sound signal analysis, whereby information may be automatically extracted from audio data, has attracted considerable attention. The recently proposed convolutional recurrent neural networks (CRNNs), which combine CNNs and RNNs, have exhibited satisfactory classification performance in SED [11] They are currently recognized as a highly effective deep neural network architecture in SED and have been widely used in the DCASE challenge since 2018. The attention method was effective in identifying sound events from audio recordings, including noisy sounds Owing to their availability, unlabeled data are critical for improved SED. In [20], for efficient use of unlabeled training data, a mean-teacher model based on an attention-based CRNN was proposed for SED and it showed the best performance in DCASE challenge 2018. Deep neural networks for SED have evolved from simple FNNs to recent CRNNs, where an attention-based architecture as well as mean-teacher model-based training and evaluation are used.

Preprocessing
Derivative
Network Architecture
Basic CRNN
Mean-Teacher Model
Structure
Database
Evaluation Metrics
Experimental Results
Learning
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call