Abstract

Although it is not immediately intuitive that Deep Convolutional Neural Networks (DCNNs) can yield adequate feature representation for a Foreground Localization (FGL) task, recent architectural and algorithmic advancements in Deep Learning (DL) have shown that the DCNNs have become forefront methodology for this pixel-level classification problem. In FGL, the DCNNs face an inherent trade-off between moving objects, i.e., the foreground (FG) and the non-static background (BG) scenes, through learning from local- and global-level features. Driven by the latest success of the innovative structures for image classification and semantic segmentation, this work introduces a novel architecture, called Slow Encoder-Decoder (sEnDec) that aims to improve the learning capacity of a traditional image-to-image DCNN. The proposed model subsumes two subnets for contraction (encoding) and expansion (decoding), wherein both phases, it employs an intermediate feature map up-sampling and residual connections. In this way, the lost structural details due to spatial subsampling are recovered. It helps to get a more delineated FG region. The experimental study is carried out with two variants of the proposed model: one with strided convolution (conv) and the other with max pooling for spatial subsampling. A comparative analysis on sixteen benchmark video sequences, including baseline, dynamic background, camera jitter, shadow effects, intermittent object motion, night videos, and bad weather show that the proposed sEnDec model performs very competitively against the prior- and state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call