Abstract
Semi-supervised Video Object Segmentation (VOS) needs to establish pixel-level correspondences between a video frame and preceding segmented frames to leverage their segmentation clues. Most works rely on features at a single scale to establish those correspondences, e.g., perform dense matching with Convolutional Neural Network (CNN) features from a deep layer. Differently, this work explores complementary features at different scales to pursue more robust feature matching. A coarse feature from a deep layer is first adopted to get coarse pixel-level correspondences. We hence evaluate the quality of those correspondences, and select pixels with low-quality correspondences for fine-scale feature matching. Segmentation clues of previous frames are propagated by both coarse and fine-scale correspondences, which are fused with appearance features for object segmentation. Compared with previous works, this coarse-to-fine matching scheme is more robust to distractions by similar objects and better preserves object details. The sparse fine-scale matching also ensures a fast inference speed. On popular VOS datasets including DAVIS and YouTube-VOS, the proposed method shows promising performance compared with recent works.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Multimedia Computing, Communications, and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.