Abstract
According to the Wall Street Journal, one billion surveillance cameras will be deployed around the world by 2021. This amount of information can be hardly managed by humans. Using a Inflated 3D ConvNet as backbone, this paper introduces a novel automatic violence detection approach that outperforms state-of-the-art existing proposals. Most of those proposals consider a pre-processing step to only focus on some regions of interest in the scene, i.e., those actually containing a human subject. In this regard, this paper also reports the results of an extensive analysis on whether and how the context can affect or not the adopted classifier performance. The experiments show that context-free footage yields substantial deterioration of the classifier performance (2% to 5%) on publicly available datasets. However, they also demonstrate that performance stabilizes in context-free settings, no matter the level of context restriction applied. Finally, a cross-dataset experiment investigates the generalizability of results obtained in a single-collection experiment (same dataset used for training and testing) to cross-collection settings (different datasets used for training and testing).
Highlights
Continuous monitoring of visual streams for the timely detection of emergency/anomalous situations is critical for effective intervention whenever two or more persons can interact, especially in public spaces
– We introduce a violence classifier built on top of a pretrained deep neural network that reports highly competitive results in action recognition
The 3D ConvNet consists of a 2D convolutional neural network that takes as input frames in gray scale in which the third dimension is the temporal information
Summary
Continuous monitoring of visual streams for the timely detection of emergency/anomalous situations is critical for effective intervention whenever two or more persons can interact, especially in public spaces. Violence detection stems in a sense from action recognition but aims solely at recognizing violent actions. From one side it is more general, since it relies on a pure binary classification, but on the other side just for the same reason it may result more complex. It requires to train a classifier on a whole class of actions. It could be worth clarifying the terms used in the following.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.