Abstract

Weakly supervised temporal action localization (WSTAL) is crucial for real world applications, as it relieves the huge burden of frame-level annotations for fully supervised action detection. Most existing WSTAL methods focused on classifying video snippets, or detecting action boundaries. However, the predictions from these well-designed models have not been fully utilized. Accordingly, we propose a weakly-supervised framework called the progressive enhancement network (PEN), which takes full advantages of the predictions generated by the preceding models to enhance the subsequent models. Specifically, snippet-level pseudo labels are generated from the preceding predictions by considering the similarity and temporal distance between action snippets. Then subsequent models are progressively enhanced by using pseudo labels as a supervision, and utilizing their underlying semantics to make the feature representation more qualified for the temporal localization task. Extensive experiments which are carried out on two popular benchmarks, THUMOS’14 and ActivityNet v1.2, demonstrate the effectiveness of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call