Abstract

Violence detection based on deep learning is a research hotspot in intelligent video surveillance. The pre-trained Three-Dimensional convolutional network (C3D) has a weak ability to extract spatiotemporal features of video. It can only achieve an accuracy of 88.2% on the UCF-101 data set, which cannot meet the accuracy requirements for detecting violent behavior in videos. Thus, this paper proposes a network architecture based on the C3D and fusion of the Inception-Resnet-v2 network residual Inception module. Through adaptive learning of feature weights, the three-dimensional features of violent behavior videos can be fully explored and the ability to express features is enhanced. Secondly, in view of the small amount of data in the data set for violence detection (HockeyFights), which easily leads to the problems of overfitting and low generalization ability, the UCF101 data set is used for fine-tune, so that the shallow layer of the network can fully extract the spatiotemporal features; Finally, the use of quantization tools to quantify network parameters and adjusting the sliding window parameters during inference can effectively improves the inference efficiency and improves the real-time performance while ensuring high accuracy. Through experiments, the accuracy of the network designed in the paper on the UCF-101 dataset is improved by 6.1% compared to the C3D network, and by 3.1% compared with the R3D network, indicating that the improved network structure can extract more spatiotemporal features, and finally achieved an accuracy of 95.1% on the HockeyFights test set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call