Abstract

Weakly-supervised video object segmentation (WVOS) is an emerging video task that can track and segment the target given a simple bounding box label. However, existing WVOS methods are still unsatisfied in either speed or accuracy, since they only use the exemplar frame to guide the prediction while they neglect the reference from other frames. To solve the problem, we propose a novel Re-Aggregation based framework, which uses feature matching to efficiently find the target and capture the temporal dependencies from multiple frames to guide the segmentation. Based on a two-stage structure, our framework builds an information-symmetric matching process to achieve robust aggregation. In each stage, we design a Query-Memory Aggregation (QMA) module to gather features from the past frames and make bidirectional aggregation to adaptively weight the aggregated features, which relieves the latent misguidance in unidirectional aggregation. To further exploit the information from different aggregation stages, we propose a novel coarse-fine constraint by using the Cascaded Refinement Module (CRM) to combine the predictions from different stages and further boosts the performance. Experimental results on three benchmarks show that our method achieves the state-of-the-art performance in WVOS (e.g., an overall score of 84.7% on the DAVIS 2016 validation set).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call