Abstract

AbstractTemporal action detection usually relies on huge tagging costs to achieve significant performance. Semi-supervised learning, where only a small amount of data are annotated in the training set, can help reduce the burden of labeling. However, the existing action detection models will inevitably learn inductive bias from limited labeled data and hinder the effective use of unlabeled data in semi-supervised learning. To this end, we propose a generic end-to-end framework for Semi-Supervised Temporal Action Detection (SS-TAD). Specifically, the framework is based on the teacher-student structure that leverages the consistency between unlabeled data and their augmentations. To achieve this, we propose a dynamic consistency loss by employing an attention mechanism to alleviate the prediction bias of the model, so it can make full use of the unlabeled data. Besides, we design a concise yet valid spatiotemporal feature perturbation module to learn robust action representations. Experiments on THUMOS14 and ActivityNet v1.2 demonstrate that our method significantly outperforms the start-of-the-art semi-supervised methods and is even comparable to the fully-supervised methods.KeywordsTemporal action detectionSemi-supervised learningTeacher-student modelDynamic consistency

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call