Abstract
Acoustic scene classification is vital for building smart cities. Most acoustic scene classification studies are performed by training convolutional neural networks. The networks constructed based on the general convolutional approach need to consider the temporal-frequency connections in the audio spectrogram. This work proposes a bottleneck structure neural network model for acoustic scene classification based on causal convolution and Sub-spectral normalization to overcome the abovementioned problem. To obtain more sample data, a new mix-spec augmentation data enhancement method is proposed by drawing on the existing classical mixup and spec augmentation methods, and the three are combined for sample augmentation. It was experimentally found that the proposed bottleneck structure helps improve the recognition accuracy and robustness of the model. In addition, the ablation experiment showed that the data augmentation training model combined with the mixup, SpecAugment, and the newly proposed mix-SpecAugment method had the highest classification accuracy. Finally, the near-square audio spectrogram has better training results by employing different shapes of audio spectrograms for the experiments. Compared with the baseline results, it is found that our proposed model significantly improved the accuracy of the baseline and reduced loss.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have