Abstract

Automatic violence detection has received continuous attention due to its broad application prospects. However, most previous work prefers building a generalized pipeline while ignoring the complexity and diversity of violent scenes. In most cases, people judge violence by a variety of sub-concepts, such as blood, fighting, screams, explosions, etc., which may show certain co-occurrence trends. Therefore, we argue that parsing abstract violence into specific semantics helps to obtain the essential representation of violence. In this paper, we propose a semantic multimodal violence detection framework based on local-to-global embedding. The local semantic detection is designed to capture fine-grained violent elements in the video via a set of local semantic detectors, which is generated from a variety of external word embeddings. Also, we introduce a global semantic alignment branch to mitigate the intra-class variance of violence, in which violent video embeddings are guided to form a compact cluster while keeping a semantic gap with non-violent embeddings. Furthermore, we construct a multimodal cross-fusion network (MCN) for multimodal feature fusion, which consists of a cross-adaptive module and a cross-perceptual module. The former aims to eliminate inter-modal heterogeneity, while the latter suppresses task-irrelevant redundancies to obtain robust video representations. Extensive experiments demonstrate the effectiveness of the proposed method, which has a superior generalization capacity and achieves competitive performance on five violence datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call