Abstract

This paper proposes a novel self-supervised method for one-shot video object segmentation where the object annotations are only provided in the first frame. The current self-supervised video object segmentation approaches are implemented by modeling the pairwise correspondence between the target and reference frames. The pairwise correspondence only maintains spatio-temporal consistency. However, the VOS tasks not only require a spatio-temporal relationship between the two frames but also require the salient object information for each frame. In order to achieve this goal, we propose a disentangled representation strategy to disentangle the temporal correspondence into the pairwise term and unary term. The pairwise and unary terms capture inter-frame spatio-temporal and intra-frame salient object information, respectively. To demonstrate the importance of the disentangled representation, we apply the proposed approach to DAVIS-2017 and YouTube-VOS datasets. Experimental results confirm the effectiveness of the proposed solution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.