Abstract

In the era of smart cities, the advent of the Internet of Things technology has catalyzed the proliferation of multimodal sensor data, presenting new challenges in cross-modal event detection, particularly in audio event detection via textual queries. This paper focuses on the novel task of text-to-audio grounding (TAG), aiming to precisely localize sound segments that correspond to events described in textual queries within an untrimmed audio. This challenging new task requires multi-modal (acoustic and linguistic) information fusion as well as the reasoning for the cross-modal semantic matching between the given audio and textual query. Unlike conventional methods that often overlook the nuanced interactions between and within modalities, we introduce the Cross-modal Graph Interaction (CGI) model. This innovative approach leverages a language graph to model complex semantic relationships between query words, enhancing the understanding of textual queries. Additionally, a cross-modal attention mechanism generates snippet-specific query representations, facilitating fine-grained semantic matching between audio segments and textual descriptions. A cross-gating module further refines this process by emphasizing relevant features across modalities and suppressing irrelevant information, optimizing multimodal information fusion. Our comprehensive evaluation on the Audiogrounding benchmark dataset not only demonstrates the CGI model’s superior performance over existing methods, but also underscores the significance of sophisticated multimodal interaction in improving the efficacy of TAG in smart cities.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call