Abstract

To reduce neural network parameter counts and improve sound event detection performance, we propose a multiscale time-frequency convolutional recurrent neural network (MTF-CRNN) for sound event detection. Our goal is to improve sound event detection performance and recognize target sound events with variable duration and different audio backgrounds with low parameter counts. We exploit four groups of parallel and serial convolutional kernels to learn high-level shift-invariant features from the time and frequency domains of acoustic samples. A two-layer bidirectional gated recurrent unit is used to capture the temporal context from the extracted high-level features. The proposed method is evaluated on two different sound event datasets. Compared to that of the baseline method and other methods, the performance is greatly improved as a single model with low parameter counts without pretraining. On the TUT Rare Sound Events 2017 evaluation dataset, our method achieved an error rate (ER) of 0.09±0.01, which was an improvement of 83% compared with the baseline. On the TAU Spatial Sound Events 2019 evaluation dataset, our system achieved an ER of 0.11±0.01, a relative improvement over the baseline of 61%, and F1 and ER values that are better than those of the development dataset. Compared to the state-of-the-art methods, our proposed network achieves competitive detection performance with only one-fifth of the network parameter counts.

Highlights

  • Sound event detection (SED) recognizes a target sound and detects the onset and offset times in an audio recording

  • We can see that the 3×3 convolutional recurrent neural networks (CRNNs) performs better than the other STF-CRNNs; the reason might be because its scale is appropriate for the dataset

  • SED is a subtask of DCASE2019 Task 3, so the SED performance of CE-CRNN is slightly worse than Twostage [50]

Read more

Summary

Introduction

Sound event detection (SED) recognizes a target sound and detects the onset and offset times in an audio recording. We can detect audio signals such as gunshots, the cries of babies, falls, and the malfunctioning of a machine through the sound that is given off, as well as endangered animal sounds, which allows us to respond appropriately [1]. Traditional sound event detection methods mainly include signal analysis, information entropy [5], statistical analysis, and clustering [6], [7]. Research on SED has shifted from Gaussian mixture models-hidden Markov models (GMM-HMMs) and support vector machines (SVMs) [8] to deep neural networks (DNNs) [9], convolutional neural networks (CNNs), recurrent neural networks (RNNs), and convolutional recurrent neural networks (CRNNs)

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.