Abstract
Background subtraction is an important task in computer vision. Traditional approaches usually utilize low-level visual features like color, texture, or edge to build background models. Due to the lack of deep features, they often achieve poor performance when facing complex video scenes such as illumination changes, background, or camera motions, camouflage effects and shadows. Recently, deep learning has shown to perform well in extracting deep features. To improve the robustness of background subtraction, in this paper, we propose an end-to-end multi-scale spatio-temporal (MS-ST) method which is able to extract deep features from video sequences. First, a video clip is input into a convolutional neural network for extracting multi-scale spatial features. Subsequently, to exploit the temporal information, we combine temporal sampling operations and ConvLSTM modules to extract the multi-scale temporal contextual information. Finally, the segmentation result is generated by fusing multi-scale spatio-temporal features. The experimental results on the CDnet-2014 dataset and the LASIESTA dataset demonstrate the effectiveness and superiority of the proposed method.
Highlights
Background subtraction is an important task in the computer vision domain and it plays a fundamental role in many applications such as automatic drive [1], object tracking [2], crowd analysis [3], traffic analytics [4], and automated anomaly detection [5] in video surveillance
The common background subtraction evaluation metrics are used for comparison including: Recall, Precision, Specificity, False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Wrong Classifications (PWC), and F-Measure
In this paper, we proposed a novel background subtraction method to label the foreground on video sequences automa-tically
Summary
Background subtraction is an important task in the computer vision domain and it plays a fundamental role in many applications such as automatic drive [1], object tracking [2], crowd analysis [3], traffic analytics [4], and automated anomaly detection [5] in video surveillance. These algorithms work well only on some specific or simple videos, but yield poor performance when facing sudden illumination changes, hard shadows, camouflage and so on. Yang et al [35] proposed a background modeling method, which extracts spatio-temporal features using 2D fully convolutional network. Multi-scale features are effectively extracted by 3D convolution operations in both spatial and temporal domains, [37] performs poorly when processing intermittent motion. We propose to subtract background by using a novel end-to-end multi-scale spatio-temporal (MS-ST) method without complex background model and conventional hand-crafted features. 2D CNN and ConvLSTM are used to extract deep multi-scale temporal and spatial features from input video clip.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.