Abstract
Sound source localization in visual scenes is to associate sounds and their visual producers. Although great progress has been made in this field, the mixed sounds from multiple objects make it intractable to perform efficient localization in the unconstrained scenarios. In this paper, we propose a novel cross-modal sound source localization networks (SSLNet), which is accessible to dispose audio and visual information. First, to encode spatial and temporal information in the spectrogram, we combine convolutional neural networks (CNN) and bidirectional long short-term memory networks (BiLSTM). For different time-step inputs of BiLSTMs, we propose a novel operation, called as grouped global average pooling (GGAP), to divide the 3-D spectrogram feature block into several 1-D vectors. Then, a cross-modal channel attention mechanism is introduced to alleviate the discrepancy of different modalities. Finally, for achieving the pixel-level localization, we propose a fusion approach using cosine and L2 distances to measure the similarity between audio and visual vectors. We implement extensive experiments in benchmark and FAIR-Play datasets. The qualitative results demonstrate the effectiveness of sound source localization. On benchmark dataset, the quantitative results show that SSLNet can improve the localization accuracy by 1.2% in consensus intersection over union (cIoU) and 16.4% in area under the curve (AUC). On FAIR-Play dataset, SSLNet also achieves superior performance with Pearson’s correlation coefficient (CC) of 0.779 and similarity metric (SIM) of 0.650.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.