From Pixels to Semantics: Self-Supervised Video Object Segmentation With Multiperspective Feature Mining.

Ruoqi Li,Huchuan Lu,Xiaopeng Wei,Yifan Wang,Qiang Zhang,Lijun Wang

doi:10.1109/tip.2022.3201603

Abstract

Existing self-supervised methods pose one-shot video object segmentation (O-VOS) as pixel-level matching to enable segmentation mask propagation across frames. However, the two tasks are not fully equivalent since O-VOS is more reliant on semantic correspondence rather than accurate pixel matching. To remedy this issue, we explore a new self-supervised framework that integrates pixel-level correspondence learning with semantic-level adaptation. The pixel-level correspondence learning is performed through photometric reconstruction of adjacent RGB frames during offline training, while semantic-level adaption operates at test-time by enforcing a bi-directional agreement of the predicted segmentation masks. In addition, we further propose a new network architecture with multi-perspective feature mining mechanism which can not only enhance reliable features but also suppress noisy ones to facilitate more robust image matching. By training the network using the proposed self-supervised framework, we achieve state-of-the-art performance on widely adopted datasets, further closing up the gap between self-supervised learning methods and their fully supervised counterparts.

Full Text