Abstract

In recent years, with the construction of intelligent cities, the importance of environmental sound classification (ESC) research has become increasingly prominent. However, due to the non-stationary nature of environment sound and the strong interference of ambient noise, the recognition accuracy of ESC is not high enough. Even with deep learning methods, it is difficult to fully extract features from models with a single input. Aiming to improve the recognition accuracy of ESC, this paper proposes a two-stream convolutional neural network (CNN) based on raw audio CNN (RACNN) and logmel CNN (LMCNN). In this method, a pre-emphasis module is first constructed to deal with raw audio signal. The processed audio data and logmel data are imported into RACNN and LMCNN, respectively to obtain both of time and frequency features of audio. In addition, a random-padding method is proposed to patch shorter data sequences. In such a way, the available data for experiment are greatly increased. Finally, the effectiveness of the methods has been verified based on UrbanSound8K dataset in experimental part.

Highlights

  • Speech recognition technology, as one of the representatives of the new generation of information technology, has become more and more mature

  • A two-stream convolutional neural network model (CNN) is proposed based on deep learning to improve accuracy of environmental sound classification (ESC). Both time domain and frequency domain features of audio signal are introduced as input signal, and a pre-emphasis module is constructed at input layer to improve signal-to-noise ratio (SNR)

  • Batch normalization (BN) and global average pooling are used in front of the fully connected layer of each stream CNN to reduce the number of parameters, which are not marked in the figure

Read more

Summary

INTRODUCTION

As one of the representatives of the new generation of information technology, has become more and more mature. Research objects of ESC are mainly features extracted manually, such as Mel-frequency cepstral coefficient (MFCC), linear predicted cepstral coefficient (LPCC), short-term energy and zero-crossing rate [8]. A two-stream convolutional neural network model (CNN) is proposed based on deep learning to improve accuracy of ESC. In this method, both time domain and frequency domain features of audio signal are introduced as input signal, and a pre-emphasis module is constructed at input layer to improve signal-to-noise ratio (SNR).

RELATED WORK
EXPERIMENT
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call