Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

Weidong Chen,Guorong Li,Liang Li,Xinfeng Zhang,Qingming Huang,Shuhui Wang

doi:10.1145/3514250

Abstract

In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications’ needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Jan 5, 2023
Citations: 4

Similar Papers

STCM-Net: A symmetrical one-stage network for temporal language localization in videos
Jingyu Ru ... Chunbo Li
Neurocomputing | VOL. 471
Jingyu Ru, et. al.Jingyu Ru ... Chunbo Li
16 Nov 2021
Neurocomputing | VOL. 471

Research on Chinese Natural Language Query Interface to Database Based on Syntax and Semantic
Wu Xia Ning ... Jin Kai Li
Applied Mechanics and Materials | VOL. 731
Wu Xia Ning, et. al.Wu Xia Ning ... Jin Kai Li
01 Jan 2015
Applied Mechanics and Materials | VOL. 731

An efficient natural language interface to XML database
Janu R Panicker ... Meera M
-
Janu R Panicker, et. al.Janu R Panicker ... Meera M
01 Aug 2016
01 Aug 2016

The Multidimensional Motion Features of Spatial Depth Feature Maps: An Effective Motion Information Representation Method for Video-Based Action Recognition
Jifeng Sun ... Florin Stoican
Mathematical Problems in Engineering | VOL. 2021
Jifeng Sun, et. al.Jifeng Sun ... Florin Stoican
28 Jan 2021
Mathematical Problems in Engineering | VOL. 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications