Abstract

This paper provides a modular architecture with deep neural networks as a solution for real-time video analytics in an edge-computing environment. The modular architecture consists of two networks of Front-CNN (Convolutional Neural Network) and Back-CNN, where we adopt Shallow 3D CNN (S3D) as the Front-CNN and a pre-trained 2D CNN as the Back-CNN. The S3D (i.e., the Front CNN) is in charge of condensing a sequence of video frames into a feature map with three channels. That is, the S3D takes a set of sequential frames in the video shot as input and yields a learned 3 channel feature map (3CFM) as output. Since the 3CFM is compatible with the three-channel RGB color image format, we can use the output of the S3D (i.e., the 3CFM) as the input to a pre-trained 2D CNN of the Back-CNN for the transfer-learning. This serial connection of Front-CNN and Back-CNN architecture is end-to-end trainable to learn both spatial and temporal information of videos. Experimental results on the public datasets of UCF-Crime and UR-Fall Detection show that the proposed S3D-2DCNN model outperforms the existing methods and achieves state-of-the-art performance. Moreover, since our Front-CNN and Back-CNN modules have a shallow S3D and a light-weighted 2D CNN, respectively, it is suitable for real-time video recognition in edge-computing environments. We have implemented our CNN model on NVIDIA Jetson Nano Developer as an edge-computing device to show its real-time execution.

Highlights

  • S URVEILLANCE cameras have been increasingly deployed in public places for the purpose of monitoring abnormal events such as criminal activities and medical emergencies [1], [2]

  • The traditional anomaly detection mainly relied on the motion information between two consecutive frames extracted by optical flow [4] or dynamic Bayesian Network (DBN) [5]

  • The Convolutional 3D (C3D) learns the temporal motions as well as the spatial features from video frames. This requires for the C3D to execute complex 3D convolutions with the kernel dimension of Rc×d×d×T, where c is the number of channels, d is the spatial size (i.e., d × d) of the filter, and T is the number of frames in the video clips

Read more

Summary

Introduction

S URVEILLANCE cameras have been increasingly deployed in public places for the purpose of monitoring abnormal events such as criminal activities and medical emergencies [1], [2]. The C3D learns the temporal motions as well as the spatial features from video frames. A pre-trained CNN is fine-tuned by 3 grayscale frames, which are subsampled from a video shot. The SG3Is formed from the training videos are used to fine-tune the pre-trained 2D CNN to learn the motion. Many algorithms [4], [5], [13]–[15] have been developed to handle vast amounts of data automatically These algorithms can be used for the video recognition in a cloud server. A violence detection [16] was performed by transmitting the video data obtained from the drone camera to the cloud server. By transmitting the road video obtained from the camera to the cloud server, the license plate of the vehicle was extracted [17]

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.