Enabling Weakly Supervised Temporal Action Localization From On-Device Learning of the Video Stream

Yue Tang,Yawen Wu,Jingtong Hu,Peipei Zhou

doi:10.1109/tcad.2022.3197536

Abstract

Detecting actions in videos have been widely applied in on-device applications, such as cars, robots, etc. Practical on-device videos are always untrimmed with both action and background. It is desirable for a model to both recognize the class of action and localize the temporal position where the action happens. Such a task is called temporal action location (TAL), which is always trained on the cloud where multiple untrimmed videos are collected and labeled. It is desirable for a TAL model to continuously and locally learn from new data, which can directly improve the action detection precision while protecting customers’ privacy. However, directly training a TAL model on the device is nontrivial. To train a TAL model which can precisely recognize and localize each action, tremendous video samples with temporal annotations are required. However, annotating videos frame by frame is exorbitantly time consuming and expensive. Although weakly supervised temporal action localization (W-TAL) has been proposed to learn from untrimmed videos with only video-level labels, such an approach is also not suitable for on-device learning scenarios. In practical on-device learning applications, data are collected in streaming. For example, the camera on the device keeps collecting video frames for hours or days, and the actions of nearly all classes are included in a single long video stream. Dividing such a long video stream into multiple video segments requires lots of human effort, which hinders the exploration of applying the TAL tasks to realistic on-device learning applications. To enable W-TAL models to learn from a long, untrimmed streaming video, we propose an efficient video learning approach that can directly adapt to new environments. We first propose a self-adaptive video dividing approach with a contrast score-based segment merging approach to convert the video stream into multiple segments. Then, we explore different sampling strategies on the TAL tasks to request as few labels as possible. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream. Experimental results on the THUMOS’14 dataset show that the performance of our approach is comparable to the current W-TAL state-of-the-art (SOTA) work without any laborious manual video splitting.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enabling Weakly Supervised Temporal Action Localization From On-Device Learning of the Video Stream

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Lead the way for us

Journal: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems	Publication Date: Nov 1, 2022
Citations: 1

Similar Papers

Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks
Ziyi Liu ... Qilin Zhang
-
Ziyi Liu, et. al.Ziyi Liu ... Qilin Zhang
01 Oct 2019
01 Oct 2019

Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks.
Ziyi Liu ... Le Wang
IEEE transactions on pattern analysis and machine intelligence | VOL. 44
Ziyi Liu, et. al.Ziyi Liu ... Le Wang
01 Jan 2020
IEEE transactions on pattern analysis and machine intelligence | VOL. 44

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning
Can Zhang ... Yuexian Zou
-
Can Zhang, et. al.Can Zhang ... Yuexian Zou
01 Jun 2021
01 Jun 2021

Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization
Weiqi Sun ... Dong Xu
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 33
Weiqi Sun, et. al.Weiqi Sun ... Dong Xu
01 Jan 2023
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enabling Weakly Supervised Temporal Action Localization From On-Device Learning of the Video Stream

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems