Underwater acoustic classification using masked modeling-based swin transformer

Kang You,Kele Xu,Boqing Zhu,Ming Feng

doi:10.1121/10.0016332

Abstract

Underwater acoustic classification is a challenging task due to complex background noise and complicated sound propagation patterns. How to represent the signals is important for the classification task. In this paper, we propose a novel representation learning method for the underwater acoustic signals, leveraging the mask modeling-based self-supervised learning paradigm. Specifically, we first explore modifying the Swin Transformer architecture to learn general representation for the audio signals, accompanied with random masking on the log-mel spectrogram. The main goal of the pretext task is to predict the masked parts of Log-mel spectrogram and the gamma-stone spectrogram, so that the model can not only learn the local and global features but also learn complementary information. For downstream task, we utilize the labelled datasets to fine-tune the pre-trained model. On DeepShip datasets which consist of 47 hand 4 minof ship sounds in four categories, our model achieves state-of-the-art performance compared with competitive approaches. Our method obtains a classification accuracy of 78.03%, which is better than the separable convolution autoencoder (SCAE) and using the constant-Q transform spectrogram. This work demonstrates the potential of the masked modeling based self-supervised learning for understanding and interpretation of underwater acoustic signals.

Full Text