Abstract

Existing self-supervised methods pose one-shot video object segmentation (O-VOS) as pixel-level matching to enable segmentation mask propagation across frames. However, the two tasks are not fully equivalent since O-VOS is more reliant on semantic correspondence rather than accurate pixel matching. To remedy this issue, we explore a new self-supervised framework that integrates pixel-level correspondence learning with semantic-level adaptation. The pixel-level correspondence learning is performed through photometric reconstruction of adjacent RGB frames during offline training, while semantic-level adaption operates at test-time by enforcing a bi-directional agreement of the predicted segmentation masks. In addition, we further propose a new network architecture with multi-perspective feature mining mechanism which can not only enhance reliable features but also suppress noisy ones to facilitate more robust image matching. By training the network using the proposed self-supervised framework, we achieve state-of-the-art performance on widely adopted datasets, further closing up the gap between self-supervised learning methods and their fully supervised counterparts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call