Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02° on the STARSS23 dataset, respectively.
Read full abstract