Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities

Haoyu Tang,Yupeng Hu,Yunxiao Wang,Shuaike Zhang,Mingzhu Xu,Jihua Zhu,Qinghai Zheng

doi:10.1016/j.inffus.2024.102460

Abstract

In the era of smart cities, the advent of the Internet of Things technology has catalyzed the proliferation of multimodal sensor data, presenting new challenges in cross-modal event detection, particularly in audio event detection via textual queries. This paper focuses on the novel task of text-to-audio grounding (TAG), aiming to precisely localize sound segments that correspond to events described in textual queries within an untrimmed audio. This challenging new task requires multi-modal (acoustic and linguistic) information fusion as well as the reasoning for the cross-modal semantic matching between the given audio and textual query. Unlike conventional methods that often overlook the nuanced interactions between and within modalities, we introduce the Cross-modal Graph Interaction (CGI) model. This innovative approach leverages a language graph to model complex semantic relationships between query words, enhancing the understanding of textual queries. Additionally, a cross-modal attention mechanism generates snippet-specific query representations, facilitating fine-grained semantic matching between audio segments and textual descriptions. A cross-gating module further refines this process by emphasizing relevant features across modalities and suppressing irrelevant information, optimizing multimodal information fusion. Our comprehensive evaluation on the Audiogrounding benchmark dataset not only demonstrates the CGI model’s superior performance over existing methods, but also underscores the significance of sophisticated multimodal interaction in improving the efficacy of TAG in smart cities.

Full Text