Abstract

Environmental sound feature extraction and classification are important signal analysis tools in many applications, such as audio surveillance, multimedia retrieval, and auditory source identification. However, the non-stationarity and discontinuity of environmental signals make quantification and classification a formidable challenge. Hence, researchers proposed to use the time-frequency image representation to quantify these non-stationarity, resulting in higher classification accuracy. In this paper, a time-frequency representation method is proposed to represent environmental sound signals. Our approach consists of three stages: Firstly, we propose an adaptive variational modal decomposition (AVMD) based on central angular frequency difference to decompose environmental sounds into a series of modes. Secondly, we use the pseudo Wigner-Vile distribution (PWVD) to accurately obtain the instantaneous frequency characteristics of mode signals. Thirdly, time-frequency images of sound signals are obtained by combining the mode signals with PWVD. Finally, we put the time-frequency image into a convolutional neural network (CNN) for classification. The method is tested on the Real World Computing Partnership (RWCP) Sound Scene Database of 50 classes in mismatched conditions. Results show that our method is robust to noise and achieves the best average recognition accuracy compared with several state-of-art methods under clean and various noisy conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call