Abstract
In this paper, we study the language-level video object segmentation where the first-frame language annotation is provided to describe the target object. By taking full advantage of the characteristic that a language label is normally compatible to all frames in a video, the proposed method can choose the most suitable starting frame to mitigate the initialization failure issue. Moreover, apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is proposed to better represent moving objects. Ultimately, scores of multiple standards are aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including DAVIS 2017, DAVIS 2016, SegTrack V2 and YoutubeObject datasets, and the new state-of-the-art accuracy (mean region similarity) is obtained on both DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. Source code will be published together with the paper.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.