Abstract

Recently, optical flow guided video saliency detection methods have achieved high performance. However, the computation cost of optical flow is usually expensive, which limits the applications of these methods in time-critical scenarios. In this article, we propose an end-to-end cross complementary network (CCNet) based on fully convolutional network for video saliency detection. The CCNet consists of two effective components: single-image representation enhancement (SRE) module and spatiotemporal information learning (STIL) module. The SRE module provides robust saliency feature learning for a single image through a pyramid pooling module followed by a lightweight channel attention module. As an effective alternative operation of optical flow to extract spatiotemporal information, the STIL introduces a spatiotemporal information fusion module and a video correlation filter to learn the spatiotemporal information, the inner collaborative and interactive information between consecutive input groups. In addition to enhancing the feature representation of a single image, the combination of SRE and STIL can learn the spatiotemporal information and the correlation between consecutive images well. Extensive experimental results demonstrate the effectiveness of our method in comparison with 14 state-of-the-art approaches.

Highlights

  • V IDEO salient object detection (VSOD) aims at finding the most obvious object in each video group

  • Our goal is to design a network with less computation, which can effectively extract rich spatiotemporal information and low-level and high-level features to generate a group of pixelwise salient object maps with high quality

  • Inspired by channel attention block (CA), in order to optimize our network in time dimension, we propose a video correlation filter in spatiotemporal information learning (STIL) to learn the importance of T feature map blocks corresponding to T input frames in each group

Read more

Summary

Introduction

V IDEO salient object detection (VSOD) aims at finding the most obvious object in each video group. It can be applied as a basic component in many visual tasks, such as video object segmentation [1], [2], video compression [3], object tracking [4] and so on. VSOD can be roughly divided into two categories: human eye fixation prediction and mask prediction (salient object detection). Jiang et al [9] design a two-layer convolutional long short-term memory (2C-LSTM) network to learn spatiotemporal features for predicting the inter-frame saliency

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.