Abstract
Weakly supervised temporal action localization (WSTAL) is crucial for real world applications, as it relieves the huge burden of frame-level annotations for fully supervised action detection. Most existing WSTAL methods focused on classifying video snippets, or detecting action boundaries. However, the predictions from these well-designed models have not been fully utilized. Accordingly, we propose a weakly-supervised framework called the progressive enhancement network (PEN), which takes full advantages of the predictions generated by the preceding models to enhance the subsequent models. Specifically, snippet-level pseudo labels are generated from the preceding predictions by considering the similarity and temporal distance between action snippets. Then subsequent models are progressively enhanced by using pseudo labels as a supervision, and utilizing their underlying semantics to make the feature representation more qualified for the temporal localization task. Extensive experiments which are carried out on two popular benchmarks, THUMOS’14 and ActivityNet v1.2, demonstrate the effectiveness of our method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Visual Communication and Image Representation
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.