Abstract

We propose a novel and unified solution for user-guided video object segmentation tasks. In this work, we consider two scenarios of user-guided segmentation: semi-supervised and interactive segmentation. Due to the nature of the problem, available cues - video frame(s) with object masks (or scribbles) - become richer with the intermediate predictions (or additional user inputs). However, the existing methods make it impossible to fully exploit this rich source of information. We resolve the issue by leveraging memory networks and learning to read relevant information from all available sources. In the semi-supervised scenario, the previous frames with object masks form an external memory, and the current frame as the query is segmented using the information in the memory. Similarly, to work with user interactions, the frames that are given user inputs form the memory that guides segmentation. Internally, the query and the memory are densely matched in the feature space, covering all the space-time pixel locations in a feed-forward fashion. The abundant use of the guidance information allows us to better handle challenges such as appearance changes and occlusions. We validate our method on the latest benchmark sets and achieve state-of-the-art performance along with a fast runtime.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call