Abstract

Advances in moving object detection have been driven by the active application of deep learning methods. However, many existing models render superior detection accuracy at the cost of high computational complexity and slow inference speed. This fact has hindered the development of such models in mobile and embedded vision tasks, which need to be carried out in a timely fashion on a computationally limited platform. In this paper, we propose a super-fast (inference speed-154 fps) and lightweight (model size-1.45 MB) end-to-end 3D separable convolutional neural network with a multi-input multi-output (MIMO) strategy named “3DS_MM” for moving object detection. To improve detection accuracy, the proposed model adopts 3D convolution which is more suitable to extract both spatial and temporal information in video data than 2D convolution. To reduce model size and computational complexity, the standard 3D convolution is decomposed into depthwise and pointwise convolutions. Besides, we proposed a MIMO strategy to increase inference speed, which can take multiple frames as the network input and output multiple frames of detection results. Further, we conducted the scene dependent evaluation (SDE) and scene independent evaluation (SIE) on the benchmark CDnet2014 and DAVIS2016 datasets. Compared to stateof- the-art approaches, our proposed method significantly increases the inference speed, reduces the model size, meanwhile achieving the highest detection accuracy in the SDE setup and maintaining a competitive detection accuracy in the SIE setup.

Highlights

  • W ITH the increasing amount of network cameras, produced visual data and Internet users, it becomes quite challenging and crucial to process a large amount of video data at a fast speed

  • We began with the standard 3D convolutional neural networks (CNNs) and a multi-input single-output (MISO) strategy, namely “3D CNN + MISO”

  • The resultant “3D separable CNN + MISO” method has a slightly reduced F-measure, but the inference speed increased from 26 fps to 31 fps

Read more

Summary

Introduction

W ITH the increasing amount of network cameras, produced visual data and Internet users, it becomes quite challenging and crucial to process a large amount of video data at a fast speed. Moving object detection (MOD) is the process of extracting dynamic foreground content from the video frames, such as moving vehicles or pedestrians, while discarding the non-moving background. It plays an essential role in many real-world applications [1], such as intelligent video surveillance [2], medical diagnostics [3], anomaly detection [4], human tracking and action recognition [5], [6]. Traditional methods [7]–[29] are unsupervised which do not require labeled ground truth for algorithm development They usually include two steps: background modeling and pixel classification. These traditional methods meet difficulties when applied in complex scenarios, such as videos with illumination changes, shadows, night scenes, and dynamic backgrounds

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.