Abstract

Most of previous video object segmentation methods require a large amount of pixel-level annotated video data to construct a robust model. It is quite expensive to label segmentation mask in video. In this article, we propose a self-supervised triplenet for video object segmentation, which only leverages nearly unlimited unlabeled video data in training phase. Our method consists of two modules, i.e., the temporal motion module and the appearance matching module. The temporal motion module is trained based on the pixel correspondence between two video frames in a self-supervised manner, which models the motion patterns between two video frames and propagates the labels from one frame to another. Meanwhile, the appearance matching module encodes the reference frame and its corresponding mask to generate the segmentation mask of the same object in target frame. The appearance matching module can adjust and refine the output results of temporal motion module, and avoid error accumulation by matching the reference appearance. In order to train the appearance matching module in self-supervised manner, we propose two mask generation strategies: foreground region mask generation and random color region mask generation. Extensive experiments conducted on four challenging video object segmentation datasets, i.e., DAVIS-2017, Youtube-VOS, DAVIS-2016 and SegTrack v2, demonstrate that the proposed method performs favorable against the state-of-the-art self-supervised methods, and performs even competitively with fully-supervised methods. We also show our self-supervised approach has actually superior generalizability to the majority of supervised methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call