Weakly supervised video object segmentation (WSVOS) is a vital yet challenging task in which the aim is to segment pixel-level masks with only category labels. Existing methods still have certain limitations, e.g., difficulty in comprehending appropriate spatiotemporal knowledge and an inability to explore common semantic information with category labels. To overcome these challenges, we formulate a novel framework by integrating multisource saliency and incorporating an exemplar mechanism for WSVOS. Specifically, we propose a multisource saliency module to comprehend spatiotemporal knowledge by integrating spatial and temporal saliency as bottom-up cues, which can effectively eliminate disruptions due to confusing regions and identify attractive regions. Moreover, to our knowledge, we make the first attempt to incorporate an exemplar mechanism into WSVOS by proposing an adaptive exemplar module to process top-down cues, which can provide reliable guidance for co-occurring objects in intraclass videos and identify attentive regions. Our framework, which comprises the two aforementioned modules, offers a new perspective on directly constructing the correspondence between bottom-up cues and top-down cues when ground-truth information for the reference frames is lacking. Comprehensive experiments demonstrate that the proposed framework achieves state-of-the-art performance.
Read full abstract