Abstract

This article aims to solve the video object segmentation (VOS) task in a scribble-supervised manner, in which VOS models are not only initialized with sparse target scribbles for inference but also trained by sparse scribble annotations. Thus, the annotation burdens for both initialization and training can be substantially lightened. The difficulties of scribble-supervised VOS lie in two aspects: 1) it demands a strong reasoning ability to carefully segment the target given only a sparse initial target scribble and 2) it necessitates learning dense prediction from sparse scribble annotations during training, requiring powerful learning capability. In this work, we propose a reliability-guided hierarchical memory network (RHMNet) for this task, which segments the target in a stepwise expanding strategy w.r.t. the memory reliability level. To be specific, RHMNet maintains a reliability-guided memory bank. It first uses the high-reliability memory to locate the region with high reliability belonging to the target, i.e., highly similar to the initial target scribble. Then, it expands the located high-reliability region to the entire target conditioned on the region itself and all existing memories. In addition, we propose a scribble-supervised learning mechanism to facilitate the model learning for dense prediction. It exploits the pixel-level relations within a single frame and the instance-level variations across multiple frames to take full advantage of the scribble annotations in sequence training samples. The favorable performance on four popular benchmarks demonstrates that our method is promising. Our project is available at: https://github.com/mkg1204/RHMNet-for-SSVOS.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call