Abstract

It is theoretically insufficient to construct a complete set of semantics in the real world using single-modality data. As a typical application of multi-modality perception, the audio-visual event localization task aims to match audio and visual components to identify the simultaneous events of interest. Although some recent methods have been proposed to deal with this task, they cannot handle the practical situation of temporal inconsistency that is widespread in the audio-visual scene. Inspired by the human system which automatically filters out event-unrelated information when performing multi-modality perception, we propose a discriminative cross-modality attention network to simulate such a process. Similar to human mechanism, our network can adaptively select "where" to attend, "when" to attend and "which" to attend for audio-visual event localization. In addition, to prevent our network from getting trivial solutions, a novel eigenvalue-based objective function is proposed to train the whole network to better fuse audio and visual signals, which can obtain discriminative and nonlinear multi-modality representation. In this way, even with large temporal inconsistency between audio and visual sequence, our network is able to adaptively select event-valuable information for audio-visual event localization. Furthermore, we systemically investigate three subtasks of audio-visual event localization, i.e., temporal localization, weakly-supervised spatial localization and cross-modality localization. The visualization results also help us better understand how our network works.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.