Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation

Guoqiang Li,Shengrong Gong,Shan Zhong,Lifan Zhou

doi:10.1007/978-3-031-30111-7_9

Abstract

Semi-supervised video object segmentation aims to segment the object in the video when only the annotated mask of the first frame is given. Recently, memory-based methods have attracted increasing attention with significant performance improvements. However, these methods employ pixel-level matching according to the similarity without considering the trajectory and the feature of the object, which may result in mismatching between the object and non-object region in complex scenarios. To relieve this problem, we propose spatial and temporal guidance for semi-supervised video object segmentation. The proposed method takes into account the consistency of the object in spatiotemporal domain and employs global matching to conduct pixel-level matching. Moreover, we design the spatial guidance module (SGM) to track the trajectory of the object. And we design the temporal guidance module (TGM) to focus on long-term object-level feature from the first frame. The proposed spatial and temporal guidance effectively alleviates mismatching and makes the model more robust and efficient. Experiments on YouTube-VOS and DAVIS benchmarks show that our method outperforms previous state-of-the-art methods with a fast inference speed.

Full Text