Abstract

Sound source localization in visual scenes is to associate sounds and their visual producers. Although great progress has been made in this field, the mixed sounds from multiple objects make it intractable to perform efficient localization in the unconstrained scenarios. In this paper, we propose a novel cross-modal sound source localization networks (SSLNet), which is accessible to dispose audio and visual information. First, to encode spatial and temporal information in the spectrogram, we combine convolutional neural networks (CNN) and bidirectional long short-term memory networks (BiLSTM). For different time-step inputs of BiLSTMs, we propose a novel operation, called as grouped global average pooling (GGAP), to divide the 3-D spectrogram feature block into several 1-D vectors. Then, a cross-modal channel attention mechanism is introduced to alleviate the discrepancy of different modalities. Finally, for achieving the pixel-level localization, we propose a fusion approach using cosine and L2 distances to measure the similarity between audio and visual vectors. We implement extensive experiments in benchmark and FAIR-Play datasets. The qualitative results demonstrate the effectiveness of sound source localization. On benchmark dataset, the quantitative results show that SSLNet can improve the localization accuracy by 1.2% in consensus intersection over union (cIoU) and 16.4% in area under the curve (AUC). On FAIR-Play dataset, SSLNet also achieves superior performance with Pearson’s correlation coefficient (CC) of 0.779 and similarity metric (SIM) of 0.650.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call