Acoustic Scene Classification Task Research Articles

In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.

Automatic audio content recognition has attracted an increasing attention for developing multimedia systems, for which the most popular approaches combine frame-based features with statistic models or discriminative classifiers. The existing methods are effective for clean single-source event detection but may not perform well for unstructured environmental sounds, which have a broad noise-like flat spectrum and a diverse variety of compositions. We present an automatic acoustic scene understanding framework that detects audio events through two hierarchies, acoustic scene recognition and audio event recognition, in which the former is preceded by following dominant audio sources and in turn helps infer non-dominant audio events within the same scene through modeling their occurrence correlations. On the scene recognition hierarchy, we perform adaptive segmentation and feature extraction for every input acoustic scene stream through Eigen-audiospace and an optimized feature subspace, respectively. After filtering background, scene streams are recognized by modeling the observation density of dominant features using a two-level hidden Markov model. On the audio event recognition hierarchy, scene knowledge is characterized by an audio context model that essentially describes the occurrence correlations of dominant and non-dominant audio events within this scene. Monte Carlo integration and gradient descent techniques are employed to maximize the likelihood and correctly tag each audio event. To the best of our knowledge, this is the first work that models event correlations as scene context for robust audio event detection from complex and noisy environments. Note that according to the recent report, the mean accuracy for the acoustic scene classification task by human listeners is only around 71 % on the data collected in office environments from the DCASE dataset. None of the existing methods performs well on all scene categories and the average accuracy of the best performances of the recent 11 methods is 53.8 %. The proposed method averagely achieves an accuracy of 62.3 % on the same dataset. Additionally, we create a 10-CASE dataset by manually collecting 5,250 audio clips of 10 scene types and 21 event categories. Our experimental results on 10-CASE show that the proposed method averagely achieves the enhanced performance of 78.3 %, and the average accuracy of audio event recognition can be effectively improved by capturing dominant audio sources and reasoning non-dominant events from the dominant ones through acoustic context modeling. In the future work, exploring the interactions between acoustic scene recognition and audio event detection, and incorporating other modalities to improve the accuracy are required to further advance the proposed framework.

Acoustic Scene Classification Task Research Articles

Articles published on Acoustic Scene Classification Task

An event-scene cooperative analysis network with dual-stream attention convolution module and soft parameter-sharing

Deep semantic learning for acoustic scene classification

Acoustic Scene Classification using Deep Fisher network

A Genetic Algorithm Approach to Automate Architecture Design for Acoustic Scene Classification

Sound Context Classification based on Joint Learning Model and Multi-Spectrogram Features

Deep mutual attention network for acoustic scene classification

Multi-representation knowledge distillation for audio classification

Ensemble-Guided Model for Performance Enhancement in Model-Complexity-Limited Acoustic Scene Classification

Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

A Review of Deep Learning Based Methods for Acoustic Scene Classification

Learning Hierarchy Aware Embedding from Raw Audio for Acoustic Scene Classification

Deep Scattering Spectra with Deep Neural Networks for Acoustic Scene Classification Tasks

Constrained Learned Feature Extraction for Acoustic Scene Classification

Comparative Study of MFCC Feature with Different Machine Learning Techniques in Acoustic Scene Classification

Classification of audio scenes with novel features in a fused system framework

Context-based environmental audio event recognition for scene understanding

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Acoustic Scene Classification Task Research Articles

Articles published on Acoustic Scene Classification Task

An event-scene cooperative analysis network with dual-stream attention convolution module and soft parameter-sharing

Deep semantic learning for acoustic scene classification

Acoustic Scene Classification using Deep Fisher network

A Genetic Algorithm Approach to Automate Architecture Design for Acoustic Scene Classification

Sound Context Classification based on Joint Learning Model and Multi-Spectrogram Features

Deep mutual attention network for acoustic scene classification

Multi-representation knowledge distillation for audio classification

Ensemble-Guided Model for Performance Enhancement in Model-Complexity-Limited Acoustic Scene Classification

Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

A Review of Deep Learning Based Methods for Acoustic Scene Classification

Learning Hierarchy Aware Embedding from Raw Audio for Acoustic Scene Classification

Deep Scattering Spectra with Deep Neural Networks for Acoustic Scene Classification Tasks

Constrained Learned Feature Extraction for Acoustic Scene Classification

Comparative Study of MFCC Feature with Different Machine Learning Techniques in Acoustic Scene Classification

Classification of audio scenes with novel features in a fused system framework

Context-based environmental audio event recognition for scene understanding