Abstract

Video salient object detection targets at extracting the most conspicuous objects in a video sequence, which facilitate various video processing tasks, e.g., video compression, video recognition, etc. Although remarkable progress has been made for video salient object detection, most existing methods still suffer from coarse edge boundaries which may hinder their usage in real-world applications. To alleviate this problem, in this paper, we propose a Motion Context guided Edge-preserving network (MCE-Net) model for video salient object detection. MCE-Net can generate temporally consistent salient edges, which are then leveraged to refine the salient object regions completely and uniformly. The core innovation in MCE-Net is an Asymmetric Cross-Reference Module (ACRM), which is designed to exploit the cross-modal complementarity between spatial structure and motion flow, facilitating robust salient object edge extraction. To leverage the extracted edge features for salient object refinement, we fuse them with multi-level spatial–temporal embeddings in a paralleled guidance manner, generating the final saliency results. The proposed method is end-to-end trainable and the edge annotations are generated automatically from ground truth saliency maps. Experimental evaluations on five widely-used benchmarks demonstrate that our proposed method can achieve superior performance to other outstanding methods. Moreover, the experimental results indicate that our method can preserve salient objects with clear boundary structures in video sequences.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call