Abstract

Weakly-supervised video object segmentation is an emerging video task to track and segment the target given a simple bounding box label, which requires the method to fully catch and utilize the target information. Most existing approaches only rely on the guidance of a single frame and ignore the interaction between different frames when gathering information, making them hard to achieve reliable target representation. In this paper, we propose to capture the temporal dependencies and gather information from multiple frames through bilateral temporal re-aggregation. We explore three schemes to build the aggregation: 1) a two-stage re-aggregation mechanism is applied to provide target prior to the current frame, which obtains more valid feature matching and information aggregation; 2) a query-memory bilateral aggregation module is proposed to aggregate features from an unlimited amount of past frames and enable the mutual perception between different frames to validate the gathered information; 3) we guide the learning of aggregation modules through a novel cross-task representation distillation, transferring the knowledge from a semi-supervised model to our weakly-supervised model without increasing the inference latency. These schemes collaboratively build an efficient and competent aggregation process, thus we can fully exploit the video context to make the inference. Experimental results on four benchmarks show that our method achieves superior performance than previous methods and still maintains the efficiency ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$e.g$ </tex-math></inline-formula> ., overall scores of 70.4% and 72.5% on the YouTube-VOS and DAVIS 2017 validation sets, respectively).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call