Abstract

Semi-supervised video object segmentation is the process of tracking and segmenting objects in a video sequence based on annotated masks for one or more frames. Recently, memory-based methods have attracted a significant amount of attention due to their strong performance. Having too much redundant information stored in memory, however, makes such methods inefficient and inaccurate. Moreover, a global matching strategy is usually used for memory reading, so these methods are susceptible to interference from semantically similar objects and are prone to incorrect segmentation. We propose a spatial constraint network to overcome these problems. In particular, we introduce a time-varying sensor and a dynamic feature memory to adaptively store pixel information to facilitate the modeling of the target object, which greatly reduces information redundancy in the memory without missing critical information. Furthermore, we propose an efficient memory reader that is less computationally intensive and has a smaller footprint. More importantly, we introduce a spatial constraint module to learn spatial consistency to obtain more precise segmentation; the target and distractors can be identified by the learned spatial response. The experimental results indicate that our method is competitive with state-of-the-art methods on several benchmark datasets. Our method also achieves an approximately 30 FPS inference speed, which is close to the requirement for real-time systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call