Abstract

Environmental noise pollution is a significant concern for public health and well-being. It is important to accurately identify and classify environmental noise sources to develop effective noise reduction strategies. This paper examines using an audio spectrogram transformer (AST) for environmental noise tagging tasks. The AST is a pure attention-based model that takes the spectrograms of the audio signals as input and calculates the self-attention without convolutions. Previously, it was pre-trained on large datasets such as ImageNet and AudioSet, showing higher precision than prior work. Although the hyperparameters were given, many have not been clear from the previous literature. Results show that there are a few choices for the patch split overlap, more overlap does not result in significantly improved performance. It is also shown that instead of the default 128 frequency bins, 96 is another choice, which can reduce the computations. The results further show that 30% - 40% can be masked for the frequency, 20% - 50% for the time dimension. The trained model is further tested on the dataset of different environmental noise sources collected by SiteHive Hexanodes across Australia and New Zealand. Results show that the AST model can achieve high accuracy in identifying different environmental noises.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call