Abstract

Given only video-level action categorical labels during training, weakly-supervised temporal action localization (WS-TAL) learns to detect action instances and locates their temporal boundaries in untrimmed videos. Compared to its fully supervised counterpart, WS-TAL is more cost-effective in data labeling and thus favorable in practical applications. However, the coarse video-level supervision inevitably incurs ambiguities in action localization, especially in untrimmed videos containing multiple action instances. To overcome this challenge, we observe that significant temporal contrasts among video snippets, e.g., caused by temporal discontinuities and sudden changes, often occur around true action boundaries. This motivates us to introduce a Contrast-based Localization EvaluAtioN Network (CleanNet), whose core is a new temporal action proposal evaluator, which provides fine-grained pseudo supervision by leveraging the temporal contrasts among snippet-level classification predictions. As a result, the uncertainty in locating action instances can be resolved via evaluating their temporal contrast scores. Moreover, the new action localization module is an integral part of CleanNet which enables end-to-end training. This is in contrast to many existing WS-TAL methods where action localization is merely a post-processing step. Besides, we also explore the usage of temporal contrast on temporal action proposal (TAP) generation task, which we believe is the first attempt with the weak supervision setting. Experiments on the THUMOS14, ActivityNet v1.2 and v1.3 datasets validate the efficacy of our method against existing state-of-the-art WS-TAL algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call