UrbanSound8K Dataset Research Articles

For environmental sound classification (ESC), this letter presents a learnable auditory filterbank based on a one-dimensional (1D) convolutional neural network with strong psychophysiological inductive bias in the form of a gammatone filterbank and an equal-loudness prompting normalization. In the past, a number of ESC methods based on learnable auditory features obtained by performing plain 1D convolutions on raw input waveforms for outperforming traditional handcrafted features such as a mel-frequency filterbank have been proposed. However, the large number of parameters involved in the convolutions suggests that these methods will not generalize better than a model defined by a smaller number of parameters, which is considered in this letter. Here, a learnable gammatone filterbank layer consisting of 1D kernels represented by a parametric form of the bandpass gammatone filters is proposed for acquiring a time-frequency representation of the raw waveform. A normalization with learnable parameters that control the trade-off between energy equalization and structure preservation in the spectro-temporal domain is proposed. To verify the effectiveness of the considered network and the normalization, ESC experiments on the ESC-50 and UrbanSound8K datasets were conducted. Compared to other state-of-the-art networks, the considered network performed better on the two datasets. In addition, an ensemble architecture achieved further performance improvement.

The analysis of sound information is helpful for audio surveillance, multimedia information retrieval, audio tagging, and forensic applications. Environmental audio scene recognition (EASR) and sound event recognition (SER) for audio surveillance are challenging tasks due to the presence of multiple sound sources, background noises, and the existence of overlapping or polyphonic contexts. We focus on learning robust and compact representations for environmental audio scenes and sound events using mel-frequency cepstral coefficients as basic features, which have proved to be effective in speech and audio-related tasks. In this paper, we propose a common hybrid model-based framework that learns representations with the help of generative models. We explore instance-specific adapted Gaussian mixture models for environmental audio scenes and instance-specific hidden Markov models for sound events to compute a robust, compact, and discriminatory representations. A discriminative model based classifier is then used to recognize these representations as environmental audio scenes and sound events. The performance of the proposed approaches is evaluated using the DCASE2013 scene dataset and TUT-DCASE2016 scene dataset for EASR task. Environmental Sound Classification (ESC-10) and UrbanSound8K datasets are used for SER task. The recognition accuracy of the proposed framework is significantly better than many of the state-of-the-art approaches proposed in the recent literature. The discriminative nature of the model-driven representations leads to improved efficiency for EASR and SER task. The proposed approaches are more suitable for tasks with less training data.

UrbanSound8K Dataset Research Articles

Articles published on UrbanSound8K Dataset

Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

CNN-Based Learnable Gammatone Filterbank and Equal-Loudness Normalization for Environmental Sound Classification

Visual Object Detector for Cow Sound Event Detection

Environment Sound Event Classification With a Two-Stream Convolutional Neural Network

A New Deep CNN Model for Environmental Sound Classification

Performance analysis of multiple aggregated acoustic features for environment sound classification

Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition

End-to-end environmental sound classification using a 1D convolutional neural network

Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion.

Spectro-temporal features for environmental sound classification

Spectro-temporal features for environmental sound classification

깊은 신경망을 이용한 오디오 이벤트 분류

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

UrbanSound8K Dataset Research Articles

Articles published on UrbanSound8K Dataset

Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

CNN-Based Learnable Gammatone Filterbank and Equal-Loudness Normalization for Environmental Sound Classification

Visual Object Detector for Cow Sound Event Detection

Environment Sound Event Classification With a Two-Stream Convolutional Neural Network

A New Deep CNN Model for Environmental Sound Classification

Performance analysis of multiple aggregated acoustic features for environment sound classification

Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition

End-to-end environmental sound classification using a 1D convolutional neural network

Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion.

Spectro-temporal features for environmental sound classification

Spectro-temporal features for environmental sound classification

깊은 신경망을 이용한 오디오 이벤트 분류