Abstract

The automatic detection of foreground (FG) objects in videos is a demanding area of computer vision, with essential applications in video-based traffic analysis and surveillance. New solutions have attempted exploiting deep neural network (DNN) for this purpose. In DNN, learning agents, i.e., features for video FG object segmentation is nontrivial, unlike image segmentation. It is a temporally processed decision-making problem, where the agents involved are the spatial and temporal correlations of the FG objects and the background (BG) of the scene. To handle this and to overcome the conventional DL models’ poor delineation at the borders of FG regions due to fixed-view receptive filed-based learning, this work introduces a Multi-view Receptive Field Encoder-Decoder Convolutional Neural Network called MvRF-CNN. The main contribution of the model is harnessing multiple views of convolutional (conv) kernels with residual feature fusions at early, mid and late stages in an encoder-decoder (EnDec) architecture. It enhances the ability of the model to learn condition-invariant agents resulting in highly delineated FG masks when compared to the existing approaches from heuristic- to DL-based techniques. The model is trained with sequence-specific labeled samples to predict scene-specific pixel-level labels of FG objects in near static scenes with a minute dynamism. The experimental study on 37 video sequences from traffic and surveillance scenarios that include complex environments, viz. dynamic background, camera jittery, intermittent object motion, scenes with cast shadows, night videos, and lousy weather proves the effectiveness of the model. The study covers two input configurations: a 3-channel (RGB) single frame and a 3-channel double-frame with a BG such that two consecutive grayscale frames stacked with a prior BG model. The ablation investigations are also conducted to show the importance of transfer learning (TL) and mid-fusion approaches for enhancing the segmentation performance and the model's robustness on failure modes: when there is lack of manually annotated hard ground truths (HGT) and testing the model under non-scene-specific videos. In overall, the model achieves a figure-of-merit of ${95\%}$ and 42 $FPS$ of mean average performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call