Abstract

Convolutional neural networks (CNNs) with log-mel audio representation and CNN-based end-to-end learning have both been used for environmental event sound recognition (ESC). However, log-mel features can be complemented by features learned from the raw audio waveform with an effective fusion method. In this paper, we first propose a novel stacked CNN model with multiple convolutional layers of decreasing filter sizes to improve the performance of CNN models with either log-mel feature input or raw waveform input. These two models are then combined using the Dempster–Shafer (DS) evidence theory to build the ensemble DS-CNN model for ESC. Our experiments over three public datasets showed that our method could achieve much higher performance in environmental sound recognition than other CNN models with the same types of input features. This is achieved by exploiting the complementarity of the model based on log-mel feature input and the model based on learning features directly from raw waveforms.

Highlights

  • In recent years, while research in auditory recognition has often been focusing on automatic speech recognition (ASR), music classification [1], and acoustic scene classification (ASC), the environmental event sound recognition (ESC) problem has received increasing attention from the research community with popular applications in audio surveillance systems [2] and noise mitigation [3]

  • We developed a novel ensemble environmental event sound recognition model, DS-Convolutional neural networks (CNNs), by fusing logmel-CNN and end-to-end raw-CNN models using DS evidence theory to exploit raw waveform features as well as the log-mel features

  • ESC-50 consists of 2000 audio files with environmental sound events of 50 balanced categories

Read more

Summary

Introduction

While research in auditory recognition has often been focusing on automatic speech recognition (ASR), music classification [1], and acoustic scene classification (ASC), the environmental event sound recognition (ESC) problem has received increasing attention from the research community with popular applications in audio surveillance systems [2] and noise mitigation [3]. In the ESC problem or sound event detection problem, the goal is to recognize the event type of a specific sound, such as a dog bark, car horn, or engine. These sound events include various daily audio events with chaotic and diverse structure [4] and can be categorized into three groups: single sounds such as a mouse-click, repeated discrete sounds such as clapping hands or typing on a keyboard, and steady continuous sounds such as the sound of a vacuum cleaner or engine [5]. A ‘bus’ scene may be identified from frequently occurring sound events such as acceleration, braking, passenger announcements, and door opening sounds, while the engine and other people’s conversations exist in the background [6]

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call