Abstract

Violence recognition is challenging since recognition must be performed on videos acquired by a lot of surveillance cameras at any time or place. It should make reliable detections in real time and inform surveillance personnel promptly when violent crimes take place. Therefore, we focus on efficient violence recognition for real-time and on-device operation, for easy expansion into a surveillance system with numerous cameras. In this paper, we propose a novel violence detection pipeline that can be combined with the conventional 2-dimensional Convolutional Neural Networks (2D CNNs). In particular, frame-grouping is proposed to give the 2D CNNs the ability to learn spatio-temporal representations in videos. It is a simple processing method to average the channels of input frames and group three consecutive channel-averaged frames as an input of the 2D CNNs. Furthermore, we present spatial and temporal attention modules that are lightweight but consistently improve the performance of violence recognition. The spatial attention module named Motion Saliency Map (MSM) can capture salient regions of feature maps derived from the motion boundaries using the difference between consecutive frames. The temporal attention module called Temporal Squeeze-and-Excitation (T-SE) block can inherently highlight the time periods that are correlated with a target event. Our proposed pipeline brings significant performance improvements compared to the 2D CNNs followed by the Long Short-Term Memory (LSTM) and much less computational complexity than existing 3D-CNN-based methods. In particular, MobileNetV3 and EfficientNet-B0 with our proposed modules achieved state-of-the-art performance on six different violence datasets. Our codes are available at https://github.com/ahstarwab/Violence_Detection.

Highlights

  • Reliable automatic surveillance systems attract much interest where occurrences of crime situations take place occasionally at any time

  • Our violence detection pipeline consist of three steps and the description of each step is as follows: 1) Based on the investigation that people in violent situations usually move actively to produce strong pixel differences between consecutive frames than others, we propose an efficient spatial attention module, which is inspired by conventional methods in image processing, such as RGB difference and morphological dilation

  • Our work focuses on computing spatial attention maps derived from the boundaries of moving objects that are further multiplied with original frames

Read more

Summary

INTRODUCTION

Reliable automatic surveillance systems attract much interest where occurrences of crime situations take place occasionally at any time. Some works utilized spatial and temporal attention modules in video action recognition to reduce redundant information over space and time [37]ā€“[41]. FRAME-GROUPING 2D convolution performs cross-correlation on a single multi-channel image by applying 2D kernels to each channel and summing the results across the channel axis Since it only encodes individual frames, it is incapable of modeling spatio-temporal information from videos. We just average the channels instead of grayscale conversion (linear combination of the channels with weights of [wR = 0.30, wG = 0.59, wB = 0.11]) since each channel of input frame Xt is already normalized with specific mean and standard deviation values and the main purpose of frame-grouping is a fast modeling of short-term dynamics rather than colorful representation to capture spatio-temporal information efficiently. T =1 where g2 denotes a single fully connected layer along the channel axis

EXPERIMENTS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call