Abstract

Weakly-supervised temporal action localization aims to correctly predict the categories and temporal intervals of actions in an untrimmed video by using only video-level labels. Previous methods aggregate category scores through a classification network to generate temporal class activation map (T-CAM), and obtain the temporal regions of the object action by using a predetermined threshold on generated T-CAM. However, class-specific T-CAM pays too much attention to those regions that are more discriminative for classification tasks, which ultimately leads to fragmentation of localization results. In this paper, we propose a complementary learning strategy for weakly-supervised temporal action localization. It obtains the erasure feature by masking the high activation value position of the original temporal class activation map, and takes it as input to train an additional classification network to produce complementary temporal class activation map. Finally, the fragmentation problem is alleviated by merging two temporal class activation map. We have conduct sufficient experiments on the THUMOS’14 and ActivityNet1.2, and the experimental results show that the localization performance of the proposed method has been greatly improved compared with the existing methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call