Learning disentangled representation for self-supervised video object segmentation

Wenjie Hou,Zheyun Qin,Xiaoming Xi,Xiankai Lu,Yilong Yin

doi:10.1016/j.neucom.2022.01.066

Abstract

This paper proposes a novel self-supervised method for one-shot video object segmentation where the object annotations are only provided in the first frame. The current self-supervised video object segmentation approaches are implemented by modeling the pairwise correspondence between the target and reference frames. The pairwise correspondence only maintains spatio-temporal consistency. However, the VOS tasks not only require a spatio-temporal relationship between the two frames but also require the salient object information for each frame. In order to achieve this goal, we propose a disentangled representation strategy to disentangle the temporal correspondence into the pairwise term and unary term. The pairwise and unary terms capture inter-frame spatio-temporal and intra-frame salient object information, respectively. To demonstrate the importance of the disentangled representation, we apply the proposed approach to DAVIS-2017 and YouTube-VOS datasets. Experimental results confirm the effectiveness of the proposed solution.

Full Text