Abstract

Weakly supervised temporal action detection aims at localizing actions with only video-level labels rather than lots of frame-level labels. To this end, previous methods train a classification network for mining discernible action frames as detection results. However, the classification network is known to only concentrate on local discernible frames rather than the entire action instance. Therefore, substantial numbers of indiscernible action frames are not detected and the detection results are incomplete. To alleviate this issue, we propose a novel method to facilitate the detection of indiscernible frames based on learning frame-level affinities. In the proposed method, we design a network (named Affinity Network) for predicting affinities between pairs of adjacent frames. Then, the affinities are used as transition probabilities to propagate local responses to indiscernible frames. As a result, the responses of indiscernible frames can be enhanced and the detection of them can be facilitated. For learning the network, we propose strategies to synthesize frame-pair and video-pair training samples, which are conducive to learn frame-level affinities with only video-level labels. The experimental results on THUMOS14 dataset and ActivityNet1.2 dataset show that the detection performance of our framework outperforms most previous weakly supervised action detection methods, and is even as competitive as some fully supervised action detection methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call