Abstract

Weakly supervised Temporal Action Localization (WTAL) aims to locate the start and end boundaries of action instances and recognize their corresponding categories. Classical methods mainly rely on random erasure mechanisms, attention mechanisms, or imposing loss constraints. Despite their great progress, there are still two challenges of incomplete positioning and context confusion. Therefore, we propose a framework with complementary adversarial mechanisms to address these issues. In the adversarial learning stage, for an input snippet, we roughly determine its proper duration, by matching it with the specified multi-scaled anchors based on CAS score loss; then, it undergoes a frame-level iterative regression to precisely figure out its boundary, which can reject none closely related frames merged in, and ensures no overlapping between different action proposals. Subsequently, the GCN module explicitly enhances the feature representation for this fine localized snippet, aiming to strengthen the exclusiveness between different action snippets. Afterward, our complementary learning module calculates the similarity between the original input video Vg and the video Vr reconstructed with the above localization refined snippets, aiming to ensure no closely relevant frames missing, this checking mechanism works as feedback to guide the regression module for more accurate localization regression. Finally, each refined snippet undergoes multi-instance learning to obtain its classification score, and the top-k strategy is used to aggregate temporally adjacent snippets based on their content similarity, which can avoid fragmentation of an action proposal. This method is tested on datasets of THUMOS14 and ActivityNet1.2, and their average accuracy is 64.68% and 32.94% respectively, its comparisons with other articles prove its effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call