Abstract

Background subtraction is an important task in computer vision. Traditional approaches usually utilize low-level visual features like color, texture, or edge to build background models. Due to the lack of deep features, they often achieve poor performance when facing complex video scenes such as illumination changes, background, or camera motions, camouflage effects and shadows. Recently, deep learning has shown to perform well in extracting deep features. To improve the robustness of background subtraction, in this paper, we propose an end-to-end multi-scale spatio-temporal (MS-ST) method which is able to extract deep features from video sequences. First, a video clip is input into a convolutional neural network for extracting multi-scale spatial features. Subsequently, to exploit the temporal information, we combine temporal sampling operations and ConvLSTM modules to extract the multi-scale temporal contextual information. Finally, the segmentation result is generated by fusing multi-scale spatio-temporal features. The experimental results on the CDnet-2014 dataset and the LASIESTA dataset demonstrate the effectiveness and superiority of the proposed method.

Highlights

  • Background subtraction is an important task in the computer vision domain and it plays a fundamental role in many applications such as automatic drive [1], object tracking [2], crowd analysis [3], traffic analytics [4], and automated anomaly detection [5] in video surveillance

  • The common background subtraction evaluation metrics are used for comparison including: Recall, Precision, Specificity, False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Wrong Classifications (PWC), and F-Measure

  • In this paper, we proposed a novel background subtraction method to label the foreground on video sequences automa-tically

Read more

Summary

INTRODUCTION

Background subtraction is an important task in the computer vision domain and it plays a fundamental role in many applications such as automatic drive [1], object tracking [2], crowd analysis [3], traffic analytics [4], and automated anomaly detection [5] in video surveillance. These algorithms work well only on some specific or simple videos, but yield poor performance when facing sudden illumination changes, hard shadows, camouflage and so on. Yang et al [35] proposed a background modeling method, which extracts spatio-temporal features using 2D fully convolutional network. Multi-scale features are effectively extracted by 3D convolution operations in both spatial and temporal domains, [37] performs poorly when processing intermittent motion. We propose to subtract background by using a novel end-to-end multi-scale spatio-temporal (MS-ST) method without complex background model and conventional hand-crafted features. 2D CNN and ConvLSTM are used to extract deep multi-scale temporal and spatial features from input video clip.

RELATED WORK
TEMPORAL FEATURE EXTRACTOR
EXPERIMENTS ANALYSIS
INTRODUCTION TO DATASET
INTRODUCTION TO EVALUATION METRIC
RESULTS ON CDnet-2014 DATASET
RESULTS ON LASIESTA DATASET
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.