Abstract

Moving object detection is an essential, well-studied but still open problem in computer vision and plays a fundamental role in many applications. Traditional approaches usually reconstruct background images with hand-crafted visual features, such as color, texture, and edge. Due to lack of prior knowledge or semantic information, it is difficult to deal with complicated and rapid changing scenes. To exploit the temporal structure of the pixel-level semantic information, in this paper, we propose an end-to-end deep sequence learning architecture for moving object detection. First, the video sequences are input into a deep convolutional encoder–decoder network for extracting pixel-wise semantic features. Then, to exploit the temporal context, we propose a novel attention long short-term memory (Attention ConvLSTM) to model pixelwise changes over time. A spatial transformer network and a conditional random field layer are finally appended to reduce the sensitivity to camera motion and smooth the foreground boundaries. A multi-task loss is proposed to jointly optimization for frame-based classification and temporal prediction in an end-to-end network. Experimental results on CDnet 2014 and LASIESTA show 12.15% and 16.71% improvement to the state of the art, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call