Abstract

Urban sound event detection can automatically preload relevant information for the robot to ensure that it can be competent for various scene activity tasks. Aiming at the limitations of timbre similarity and scene recognition limited by audio collection devices, a fusion model based on self-attention mechanism is proposed in this paper. The model consists of scattering transform and self-attention model. Scattering transform computes modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators, and it is learnable compared with Mel-scale Frequency Cepstral Coefficients (MFCC), and can be used to better restore the semantic features of some sound scenes with similar timbres. Transformer has an outstanding effect on Natural Language Processing (NLP) with its self-attention mechanism. In this paper, the self-attention mechanism in its encoder is used in the model, mainly to make the feature granularity consistent to refine the features. in addition, Focal Loss function is adopted in the model to curb the problem of sample distribution imbalance. The datasets Google-Command and ESC-50 are used to supplement the scene categories of dataset UrbanSound8K. The model parameters of the learnable filters that performed well on the dataset UrbanSound8K were preserved to fine-tune the other two datasets with insufficient data volume and more target categories. The length of slice duration are further explored the on the model. Experimental results show that the model can achieve better performance in a large range of scene models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call