Abstract

In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications’ needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call