Abstract

Recently, excellent performance has been achieved in Acoustic Scene Classification (ASC) by using Convolutional Neural Networks (CNNs) and Mel spectrogram feature representations. The utilization of Mel spectrogram feature is attracting increasing attention for its effectiveness in improving the performance. In this paper, Gradient-weighted Class Activation Mapping (Grad-CAM), a CNN visualization technique, evaluates what information is perceived by a CNN. The importance of the regions in the Mel spectrogram varies significantly for the trained CNN. Some areas are significantly activated, some are not. Because the whole Mel spectrogram contains a large amount of information, some information will not take effect when the entire Mel spectrogram is fed into a CNN simultaneously, which leaves some leeway to improve the feature utilization of the Mel spectrogram. This paper proposed a method based on spectrogram decomposing and model merging to make local features more prominent and make CNN easier to train. Specifically, a whole Mel spectrogram is segmented along the time and frequency dimensions and then generates multiple sub-spectrograms. The sub-spectrograms in the same frequency bins share the same CNN sub-model. Then the prediction of the whole Mel spectrogram is obtained by merging the outputs of CNN sub-models. The experiment results show that our proposed algorithm outperforms the existing systems by 5.64%. Also, the results of confusion matrices and class activation maps demonstrate the effectiveness of Mel spectrogram decomposition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call