DevsNet: Deep Video Saliency Network using Short-term and Long-term Cues

Yuming Fang,Chi Zhang,Xiongkuo Min,Hanqin Huang,Yugen Yi,Guangtao Zhai,Chia-Wen Lin

doi:10.1016/j.patcog.2020.107294

Abstract

Recently, there have been various saliency detection methods proposed for still images based on deep learning techniques. However, the research on saliency detection for video sequences is still limited. In this study, we introduce a novel deep learning framework of saliency detection for video sequences, namely Deep Video Saliency Network (DevsNet). DevsNet mainly consists of two components: 3D Convolutional Network (3D-ConvNet) and Bidirectional Convolutional Long-Short Term Memory Network (B-ConvLSTM). 3D-ConvNet is constructed to learn short-term spatiotemporal information and the long-term spatiotemporal features are learned by B-ConvLSTM. The designed B-ConvLSTM can extract the temporal information not just from the previous video frames but also from the next frames, which demonstrates that the proposed model considers both the forward and backward temporal information. By combining the short-term and long-term spatiotemporal cues, the proposed DevsNet can extract saliency information for video sequences effectively and efficiently. Extensive experiments have been conducted to show that the proposed model can obtain better performance in spatiotemporal saliency prediction than the state-of-the-art models.

Full Text