Abstract

Video object segmentation automatically separates the interested objects from the background across a video sequence and was an active research area in recent years. The crucial challenge lies in investigating an effective architecture to fully exploit spatiotemporal correlation in a given video sequence for achieving accurate segmentation results. In this paper, we propose a novel semi-supervised Transformer-based framework called Target-guided Spatiotemporal Dual-stream Transformers (TSDT) with two separate streams to enable effective spatiotemporal context propagation. Technically, the temporal stream is used to aggregate rich temporal cues from past frames, while the spatial stream is trained to encode object location and appearance information stored in the current frame. To compress and integrate temporal features, a target guidance block (TGB) is designed to retrieve target information in the past video flow under the guidance of the current frame. The experimental results on video object segmentation benchmarks demonstrate the feasibility and effectiveness of the proposed framework. Codes and trained models are available at https://github.com/zhouweii234/TSDTVOS.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call