Abstract

Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call