An automatic fine-grained violence detection system for animation based on modified faster R-CNN

Yixin Tang,Yu Chen,Sagar A.S.M Sharifuzzaman,Tie Li

doi:10.1016/j.eswa.2023.121691

Abstract

Animation drama and movies are important entertainment sources for children and young people. However, many of these entertainment videos introduce ideas such as violence, fighting, abuse, and car accidents, which are not suitable for children and young people. In order to prevent children from viewing these videos, different countries' government bodies put restrictions on age on these videos. Violence detection is mostly done using manual inspection, which is time and resource consuming and cannot achieve a very good result. Therefore, artificial technologies such as deep learning and machine learning can detect violence in a video. Bleeding is one of the distinct characteristics of violent actions presented in such videos. This paper aims to propose a system that can detect a given video's blood in real-time. Researchers have proposed different image processing-based techniques to detect violence, but their proposed systems lack precision and real-time detection. To this end, a deep learning-based approach is proposed to detect violence in a given video and images. A Faster R-CNN model is modified due to the sophisticated characteristics of the violence in the cartoon and animation images to achieve the highest performance. The backbone of the proposed violence detection model is changed to a modified RegNet model to extract the features from the frame. The standard inner lateral connection is replaced with the modulated deformable convolutional (MDC) layer to extract deformable feature maps. A novel distributed attention module (DAM) is proposed on the feature pyramid network to improve the performance of the feature extraction. The multiscale Region of interest (ROI) Align is adapted to improve the detection performance of violence in a different scenario. Moreover, the classification approach is integrated with the detection model to classify the different levels of violence in a given frame. We have explored state-of-the-art object detection methods such as Faster R-CNN, Cascade R-CNN, yolov3-spp, SSD, FCOS, and yolov5 to detect violence in a particular frame of a video. We compared the different models' performances and found out that the modified Faster R-CNN is appropriate for real-time blood detection in a given video or image with high accuracy. This study makes a significant contribution to detecting violence in animation videos which will help different platforms and government bodies to regulate entertainment content.

Full Text