Abstract

One of the most amazing functions of the human auditory system is the ability to detect all kinds of sound events in the environment. With the technologies and hardware advances, polyphonic Sound Event Detection (SED) can be developed to mimic the ability of the human auditory system. However, the development of a SED system is no trivial task, and several different factors often hinder accuracy. Although there are several overview papers available, most of them only provide a theoretical overview of algorithms used with little discussion. Thus, to the best of the authors' knowledge, there is no comprehensive review that covers this particular domain. Therefore, this paper aims to provide an in-depth discussion of different methodologies proposed by various authors that include the features used, detection algorithms, and their corresponding accuracy and limitations. Additional information on possible trends is also discussed that can be useful for future development works.

Highlights

  • The auditory system can be considered as one of the most amazing functional groups in the human body

  • As compared to the earlier work [91], the main difference in [92] is 1) the first layer of Convolutional Network Network (CNN) use to extract features from log mel energies and GCC-PHAT is a 3D CNN, 2) the bidirectional LSTM is replaced with a bidirectional Gated Recurrent Unit (GRU), 3) early stop is based on accuracy improvement over 100 epochs instead of 50

  • In this paper, different Sound Event Detection (SED) methodologies proposed in the literature were reviewed and discussed in detail

Read more

Summary

INTRODUCTION

The auditory system can be considered as one of the most amazing functional groups in the human body. Overlapping events can be detected as long as the probabilities in a given frame exceed the fixed threshold Based on this methodology, Bisot et al [50] reported a single second segment based F1-Score of 49.5 with an ER of 69.5 when test on the TUT-SED 2016 development dataset. Ohishi et al [51] modeled the overlapping sound event using NMF, where Markov Indian Buffet Process (mIBP) and Chinese Restaurant Process (CRP) were integrated and proposed the Bayesian logistic regression to estimate the audio event labels from the activation matrix This technique was subsequently tested on an English learning podcast, and the authors [51] reported an accuracy (in terms of Area Under Curve (AUC)) of 0.79 winning a baseline GMM and three other variants of proposed methods. A summary section is provided at the end of each subsection to conclude the findings of each method

NON-HYBRID MODELS
PUBLIC DATASET AND EVALUATION METRIC USED BY DIFFERENT AUTHORS
DCASE 2017
EVALUATION METRIC
POSSIBLE FUTURE RESEARCH DIRECTIONS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call