Violent Video Detection Based on Semantic Correspondence

Chaonan Gu,Shengjin Wang,Xiaoyu Wu

doi:10.1109/access.2020.2992617

Abstract

Automatic detection of violent videos has broad application prospects in many fields such as video surveillance and movie grading. However, most existing violent video detection models based on multimodal feature fusion ignore the fact that the audio-visual data in the same violent video may not semantically correspond. Blindly fusing non-corresponding features is not beneficial even potentially harmful to models. In this paper, we propose a novel violent video detection model based on semantic correspondence between audio-visual data from the same video. Deep neural networks are used to extract features of three different modalities: appearance, motion, and audio. After that, we choose the feature-level fusion strategy to fuse these multimodal features via shared subspace learning. Semantic correspondence is used to guide this process through multitask learning and semantic embedding learning. To evaluate the effectiveness of our model, we conduct experiments on several public datasets and our self-built dataset: Violence Correspondence Detection. The results show that our model achieves quite competitive results on both.

Highlights

Violent videos do harm to the harmony and stability of society, and greatly jeopardize the physical and mental health of teenagers
The experimental results show that after semantic correspondence information is added, the performance of our model is further improved on both the public Violent Scene Detection 2015 dataset (VSD2015) [11] and our self-built Violence Correspondence Detection dataset (VCD)
We propose a violent video detection model based on semantic correspondence

Summary

INTRODUCTION

Violent videos do harm to the harmony and stability of society, and greatly jeopardize the physical and mental health of teenagers. The mainstream method to deal with it is shared subspace learning, which aims to embed data of different modalities into an intermediate common space, where the heterogeneity can be regarded as having been eliminated [10] In this process, the model may implicitly learn some relevant knowledge about correspondence between audio-visual data. The main contribution of this paper is the proposal of a violent video detection model based on semantic correspondence. The experimental results show that after semantic correspondence information is added, the performance of our model is further improved on both the public Violent Scene Detection 2015 dataset (VSD2015) [11] and our self-built Violence Correspondence Detection dataset (VCD).

RELATED WORKS

FEATURE FUSION BASELINE

EXPERIMENTS

CONCLUSION AND FUTURE WORK