Exploiting spectro-temporal locality in deep learning based acoustic event detection

Miquel Espi,Tomohiro Nakatani,Keisuke Kinoshita,Masakiyo Fujimoto

doi:10.1186/s13636-015-0069-2

Miquel Espi, Tomohiro Nakatani + Show 2 more

Open Access

https://doi.org/10.1186/s13636-015-0069-2

Copy DOI

Abstract

In recent years, deep learning has not only permeated the computer vision and speech recognition research fields but also fields such as acoustic event detection (AED). One of the aims of AED is to detect and classify non-speech acoustic events occurring in conversation scenes including those produced by both humans and the objects that surround us. In AED, deep learning has enabled modeling of detail-rich features, and among these, high resolution spectrograms have shown a significant advantage over existing predefined features (e.g., Mel-filter bank) that compress and reduce detail. In this paper, we further asses the importance of feature extraction for deep learning-based acoustic event detection. AED, based on spectrogram-input deep neural networks, exploits the fact that sounds have “global” spectral patterns, but sounds also have “local” properties such as being more transient or smoother in the time-frequency domain. These can be exposed by adjusting the time-frequency resolution used to compute the spectrogram, or by using a model that exploits locality leading us to explore two different feature extraction strategies in the context of deep learning: (1) using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and (2) introducing the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED. An experimental evaluation shows that the approaches we describe outperform our state-of-the-art deep learning baseline with a noticeable gain in the CNN case and provides insights regarding CNN-based spectrogram characterization for AED.

Highlights

In the context of conversational scene understanding, most research is directed towards the goal of automatic speech recognition (ASR), because speech is arguably the most informative sound in acoustic scenes
7 Conclusions We have described two approaches that deal with the importance of feature extraction in deep learning-based Acoustic event detection (AED)
Both models highlight the superiority of using high-resolution spectrogram patches as input to the models, thanks to deep neural networks (DNN) and their ability to model high-dimensional data

Summary

Introduction

In the context of conversational scene understanding, most research is directed towards the goal of automatic speech recognition (ASR), because speech is arguably the most informative sound in acoustic scenes. Non-speech acoustic signals provide cues that make us aware of the environment, and while most of our attention might be dedicated to actual speech, “non-speech” information is critical if we are to achieve a complete understanding of each and every situation we face. This information is implied by the speakers, and so they actively or passively neglect mentioning certain concepts that can be inferred from their location, the current activity, or event occurring in the same scene. AED applications range from rich transcription in speech communication [3, 4] and scene understanding

Objectives

Methods

Findings

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Sep 14, 2015
Citations: 80	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Acoustic Event Detection and Sound Separation for security systems and IoT devices
Alexander Iliev ... Preeti Prakash Kudva
-
Alexander Iliev, et. al.Alexander Iliev ... Preeti Prakash Kudva
18 Jun 2021
18 Jun 2021

Detection of Acoustic Events by using MFCC and Spectro-Temporal Gabor Filterbank Features
Umair Zafar Khan ... Abdul Wahid
-
Umair Zafar Khan, et. al.Umair Zafar Khan ... Abdul Wahid
21 Nov 2016
21 Nov 2016

Teacher-student Training for Acoustic Event Detection Using Audioset
Ruibo Shi ... Raymond W M Ng
-
Ruibo Shi, et. al.Ruibo Shi ... Raymond W M Ng
01 May 2019
01 May 2019

Acoustic Event Detection in Speech Overlapping Scenarios Based on High-Resolution Spectral Input and Deep Learning
Miquel Espi ... Masakiyo Fujimoto
IEICE Transactions on Information and Systems | VOL. E98.D
Miquel Espi, et. al.Miquel Espi ... Masakiyo Fujimoto
01 Jan 2015
IEICE Transactions on Information and Systems | VOL. E98.D

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing