CT-VOS: Cutout prediction and tagging for self-supervised video object segmentation

Jyoti Kini,Fahad Shahbaz Khan,Salman Khan,Mubarak Shah

doi:10.1016/j.cviu.2023.103860

Abstract

We propose a novel self-supervised Video Object Segmentation (VOS) approach that strives to achieve better visual correspondence across frames and object-background discriminability for accurate object segmentation. Distinct from previous self-supervised VOS methods, our approach is based on a learning loss formulation that takes into account both object and background information to ensure object-background discriminability, rather than using only object appearance. The objective function comprises cutout-based reconstruction (cutout region represents part of a frame, whose pixels are replaced with some constant values) and tag prediction loss terms. The cutout-based reconstruction term utilizes a simple cutout scheme to learn the pixel-wise correspondence between the current and previous frames in order to reconstruct the original current frame with added cutout region in it. The introduced cutout patch guides the model to focus on the reappearance of scene parts , thereby implicitly equipping the model to address occlusion-based scenarios Next, the tag prediction term encourages object-background separability by grouping tags of all pixels in the cutout region that are similar, while separating them from the tags of the rest of the reconstructed pixels. Additionally, we introduce a zoom-in scheme that addresses the problem of small object segmentation by capturing fine structural information at multiple scales. Our proposed approach, termed CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS. A detailed ablation showcases the importance of the proposed loss formulation to effectively establish correspondences, object-background discriminability, and the impact of our zoom-in scheme to accurately segment small-sized objects.

Full Text