Abstract

Most abnormal acoustic event detection (AAED) is completed by supervised training of deep learning methods, but manually labeled samples are costly and scarce. This work proposes a self-supervised learning representation for AAED based on contrastive learning to overcome the abovementioned problem. Auditory and visual data augmentations are applied simultaneously to create positive sample pairs. An attention mechanism is introduced into the encoder during self-supervised pre-training. A comparison between fused features by discriminant correlation analysis and a single feature is made to verify the ability of feature grasping for the self-supervised pre-trained model. The pre-training is completed on an abnormal acoustic dataset with noise. Research results show that the self-supervised pre-trained model can achieve an accuracy of 87.72% in linear evaluation and 88.70% in the downstream task with a pure small AAED dataset, which directly exceeds the results of supervised learning. This work releases the stress of the demand for abnormal acoustic event labels.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call