Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation

Zhaofan Qiu,Tao Mei,Ting Yao

doi:10.1109/tmm.2017.2759504

Abstract

Semantically labeling every pixel in a video is a very challenging task as video is an information-intensive media with complex spatio-temporal dependence. We present in this paper a novel deep convolutional network architecture, called deep spatio-temporal fully convolutional networks (DST-FCN), which leverages both spatial and temporal dependencies among pixels and voxels by training them in an end-to-end manner. Specifically, we introduce a two-stream network by learning the deep spatio-temporal dependence, in which a 2D FCN followed by the convolutional long short-term memory (ConvLSTM) is employed on the pixel level and a 3-D FCN is exploited on the voxel level. Our model differs from conventional FCN in that it not only extends FCN by adding ConvLSTM on the pixel level for exploring long-term dependence, but also proposes 3-D FCN to enable voxel level prediction. On two benchmarks of A2D and CamVid, our DST-FCN achieves superior results to state-of-the-art techniques. More remarkably, we obtain to-date the best reported results: 45.0% per-label accuracy on A2D and 68.8% mean IoU on CamVid.

Full Text