Environment Sound Classification Based on Visual Multi-Feature Fusion and GRU-AWS

Ning Peng,Jing Liu,Fubo Ding,Aibin Chen,Wenzhuo Zhang,Guoxiong Zhou,Wenjie Chen

doi:10.1109/access.2020.3032226

Abstract

There are two major questions regarding Environmental Sound Classification (ESC). What is the best audio recognition framework, and what is the most robust audio feature? For investigating above problems, the Gated Recurrent Unit (GRU) network was used to analyze the effect of single features such as Mel Scale Spectrogram (Mel), Log-Mel Scale Spectrogram (LM), and Mel frequency cepstral coefficient (MFCC) as well as multi-feature about Mel-MFCC, LM-MFCC, and Mel-LM-MFCC (T-M) in this paper. The experiment results show that in the ESC tasks, multi-features are better than the single features in the same dimensions, and LM-MFCC has the strongest robustness. Meanwhile, reverse sequence MFCC (R-MFCC) and forward and reverse mixed sequence MFCC (FR-MFCC) were also proposed to study the effects of sequence changes on audio. The experiment results show that the sequence transformation of audio features has less influence on the recognition tasks. Furthermore, to investigate the ESC task further we introduced the attention weight similar model (AWS) in to the multi-feature. The AWS model allows different audio feature attention weights of the same sound to learn from each other. It makes the GRU-AWS model focus on the frame-level features more effectively. The experiment results show that the GRU-AWS gets excellent performance with a recognition rate of 94.3%, and it outperforms other state-of-the-art methods.

Full Text