Abstract

Multimodal learning among video and audio has shown significant performance improvement in violence detection. However, video and audio do not contribute consistently, and the video modality tends to dominate when determining whether a certain scene contains violent events. In fact, a few recent multimodal learning methods for violence detection do not fully consider data differences between various modalities, which lead to optimization imbalance problem during training, and ultimately result in insufficient performance. To address this issue, we propose a Multimodal Contrastive Learning (MCL) method to make full use of video and audio information for violence detection. In specific, to avoid the video modality dominating the model training, we design a multi-encoder framework to perform task-driven feature encoding on video and audio respectively. To reduce information loss during multimodal fusion, we introduce a contrastive learning task to capture semantically consistent representations. We conduct extensive experiments on XD-Violence dataset, showing that our proposed MCL achieves an average precision improvement of 2.34% against the state-of-the-art baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call