Abstract

With the popularity of using deep learning-based models in various categorization problems and their proven robustness compared to conventional methods, a growing number of researchers have exploited such methods in environment sound classification tasks in recent years. However, the performances of existing models use auditory features like log-mel spectrogram (LM) and mel frequency cepstral coefficient (MFCC), or raw waveform to train deep neural networks for environment sound classification (ESC) are unsatisfactory. In this paper, we first propose two combined features to give a more comprehensive representation of environment sounds Then, a fourfour-layer convolutional neural network (CNN) is presented to improve the performance of ESC with the proposed aggregated features. Finally, the CNN trained with different features are fused using the Dempster–Shafer evidence theory to compose TSCNN-DS model. The experiment results indicate that our combined features with the four-layer CNN are appropriate for environment sound taxonomic problems and dramatically outperform other conventional methods. The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.

Highlights

  • Intelligent sound recognition (ISR) is a technology for identifying sound events that exist in the real environment

  • We can imagine that the system developed for automatic speech recognition (ASR) and music information recognition (MIR) will be inefficient when applying to Environmental sound classification (ESC) tasks

  • We proposed the TSCNN-DS model of intelligent sound recognition problems

Read more

Summary

Introduction

Intelligent sound recognition (ISR) is a technology for identifying sound events that exist in the real environment. Several works propose to use merged neural networks to address the above-mentioned shortcomings through integrating information from the earlier steps [17,18,19,20] In these methods, one or more CNNs are used to extract the spatial information with different acoustic features firstly. The experiment results indicate that merged neural networks with decision level fusion outperform single deep architectures in taxonomic tasks [20,21,22,23]. 2) In recent years, with the advancement of deep learning models, the CNN becomes a primary choice in environment sounds recognition and outperform the conventional classifiers like SVM or GMM [7,24].

Related Works
Feature Extraction and Combination
Structure of the MCNet and LMCNet
Dempster—Shafer
Experiment
15.9 M features
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call