Abstract

For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.

Highlights

  • E VER since advances in automatic speech recognition (ASR) were consolidated into working industrial systems [1], the prospect of algorithms that can describe, catalogue and interpret all manner of sounds has seemed close at hand

  • The baseline system achieved an accuracy of 55%; most systems were able to improve on this, our significance tests were able to demonstrate a significant improvement over baseline only for the strongest four systems

  • The results indicate that level of difficulty for the task was appropriate: the leading systems were able to improve significantly upon the baseline, yet the task was far from trivial for any of the submitted systems

Read more

Summary

Introduction

E VER since advances in automatic speech recognition (ASR) were consolidated into working industrial systems [1], the prospect of algorithms that can describe, catalogue and interpret all manner of sounds has seemed close at hand. The other strategy is to use an intermediate representation prior to classification that models the scene using a set of higher level features that are usually captured by a vocabulary or dictionary of “acoustic atoms” These atoms usually represent acoustic events or streams within the scene which are not necessarily known a priori and are learned in an unsupervised manner from the data. An example is the use of non-negative matrix factorization (NMF) to extract bases that are subsequently converted into MFCCs for compactness and used to classify a dataset of train station scenes [15] Building upon this approach, the authors in [16] used shift-invariant probabilistic latent component analysis (SIPLCA) with temporal constrains via hidden Markov models (HMMs) that led to improvement in performance. In [17] a system is proposed that uses the matching pursuit algorithm to obtain an effective time-frequency feature selection that are afterwards used as supplement to MFCCs to perform environmental sound classification

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call