Abstract

In recent years, deep learning has not only permeated the computer vision and speech recognition research fields but also fields such as acoustic event detection (AED). One of the aims of AED is to detect and classify non-speech acoustic events occurring in conversation scenes including those produced by both humans and the objects that surround us. In AED, deep learning has enabled modeling of detail-rich features, and among these, high resolution spectrograms have shown a significant advantage over existing predefined features (e.g., Mel-filter bank) that compress and reduce detail. In this paper, we further asses the importance of feature extraction for deep learning-based acoustic event detection. AED, based on spectrogram-input deep neural networks, exploits the fact that sounds have “global” spectral patterns, but sounds also have “local” properties such as being more transient or smoother in the time-frequency domain. These can be exposed by adjusting the time-frequency resolution used to compute the spectrogram, or by using a model that exploits locality leading us to explore two different feature extraction strategies in the context of deep learning: (1) using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and (2) introducing the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED. An experimental evaluation shows that the approaches we describe outperform our state-of-the-art deep learning baseline with a noticeable gain in the CNN case and provides insights regarding CNN-based spectrogram characterization for AED.

Highlights

  • In the context of conversational scene understanding, most research is directed towards the goal of automatic speech recognition (ASR), because speech is arguably the most informative sound in acoustic scenes

  • 7 Conclusions We have described two approaches that deal with the importance of feature extraction in deep learning-based Acoustic event detection (AED)

  • Both models highlight the superiority of using high-resolution spectrogram patches as input to the models, thanks to deep neural networks (DNN) and their ability to model high-dimensional data

Read more

Summary

Introduction

In the context of conversational scene understanding, most research is directed towards the goal of automatic speech recognition (ASR), because speech is arguably the most informative sound in acoustic scenes. Non-speech acoustic signals provide cues that make us aware of the environment, and while most of our attention might be dedicated to actual speech, “non-speech” information is critical if we are to achieve a complete understanding of each and every situation we face. This information is implied by the speakers, and so they actively or passively neglect mentioning certain concepts that can be inferred from their location, the current activity, or event occurring in the same scene. AED applications range from rich transcription in speech communication [3, 4] and scene understanding

Objectives
Methods
Findings
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.