Efficient Spatio-Temporal Modeling Methods for Real-Time Violence Recognition

Min-Seok Kang,Hyung-Min Park,Rae-Hong Park

doi:10.1109/access.2021.3083273

Min-Seok Kang, Hyung-Min Park + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3083273

Copy DOI

Abstract

Violence recognition is challenging since recognition must be performed on videos acquired by a lot of surveillance cameras at any time or place. It should make reliable detections in real time and inform surveillance personnel promptly when violent crimes take place. Therefore, we focus on efficient violence recognition for real-time and on-device operation, for easy expansion into a surveillance system with numerous cameras. In this paper, we propose a novel violence detection pipeline that can be combined with the conventional 2-dimensional Convolutional Neural Networks (2D CNNs). In particular, frame-grouping is proposed to give the 2D CNNs the ability to learn spatio-temporal representations in videos. It is a simple processing method to average the channels of input frames and group three consecutive channel-averaged frames as an input of the 2D CNNs. Furthermore, we present spatial and temporal attention modules that are lightweight but consistently improve the performance of violence recognition. The spatial attention module named Motion Saliency Map (MSM) can capture salient regions of feature maps derived from the motion boundaries using the difference between consecutive frames. The temporal attention module called Temporal Squeeze-and-Excitation (T-SE) block can inherently highlight the time periods that are correlated with a target event. Our proposed pipeline brings significant performance improvements compared to the 2D CNNs followed by the Long Short-Term Memory (LSTM) and much less computational complexity than existing 3D-CNN-based methods. In particular, MobileNetV3 and EfficientNet-B0 with our proposed modules achieved state-of-the-art performance on six different violence datasets. Our codes are available at https://github.com/ahstarwab/Violence_Detection.

Highlights

Reliable automatic surveillance systems attract much interest where occurrences of crime situations take place occasionally at any time
Our violence detection pipeline consist of three steps and the description of each step is as follows: 1) Based on the investigation that people in violent situations usually move actively to produce strong pixel differences between consecutive frames than others, we propose an efficient spatial attention module, which is inspired by conventional methods in image processing, such as RGB difference and morphological dilation
Our work focuses on computing spatial attention maps derived from the boundaries of moving objects that are further multiplied with original frames

Summary

INTRODUCTION

Reliable automatic surveillance systems attract much interest where occurrences of crime situations take place occasionally at any time. Some works utilized spatial and temporal attention modules in video action recognition to reduce redundant information over space and time [37]–[41]. FRAME-GROUPING 2D convolution performs cross-correlation on a single multi-channel image by applying 2D kernels to each channel and summing the results across the channel axis Since it only encodes individual frames, it is incapable of modeling spatio-temporal information from videos. We just average the channels instead of grayscale conversion (linear combination of the channels with weights of [wR = 0.30, wG = 0.59, wB = 0.11]) since each channel of input frame Xt is already normalized with specific mean and standard deviation values and the main purpose of frame-grouping is a fast modeling of short-term dynamics rather than colorful representation to capture spatio-temporal information efficiently. T =1 where g2 denotes a single fully connected layer along the channel axis

EXPERIMENTS

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2021
Citations: 31	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Efficient Spatio-Temporal Modeling Methods for Real-Time Violence Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

One Spatio-Temporal Sharpening Attention Mechanism for Light-Weight YOLO Models Based on Sharpening Spatial Attention.
Mengfan Xue ... Yunfei Guo
Sensors (Basel, Switzerland) | VOL. 21
Mengfan Xue, et. al.Mengfan Xue ... Yunfei Guo
28 Nov 2021
Sensors (Basel, Switzerland) | VOL. 21

An End to End Framework With Adaptive Spatio-Temporal Attention Module for Human Action Recognition
Shaocan Liu ... Yibin Li
IEEE access : practical innovations, open solutions | VOL. 8
Shaocan Liu, et. al.Shaocan Liu ... Yibin Li
01 Jan 2020
IEEE access : practical innovations, open solutions | VOL. 8

Human Violence Recognition in Video Surveillance in Real-Time
Herwin Alayn Huillcen Baca ... Juan Carlos Gutierrez Caceres
-
Herwin Alayn Huillcen Baca, et. al.Herwin Alayn Huillcen Baca ... Juan Carlos Gutierrez Caceres
01 Jan 2023
01 Jan 2023

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition
Zhonghua Sun ... Kebin Jia
-
Zhonghua Sun, et. al.Zhonghua Sun ... Kebin Jia
17 Nov 2022
17 Nov 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Spatio-Temporal Modeling Methods for Real-Time Violence Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions