Abstract

Weakly supervised Temporal Action localization (WTAL) aims to locate the action instances and identify their corresponding labels. Most current methods rely on a Multi-Instance Learning (MIL) framework to predict start and end boundaries of each action in a video. However, they have shortcomings of incomplete positioning and context confusion. Therefore, we propose an algorithm of Double Branch Synergies with Modal Reinforcement (DBSMR), which utilizes long-short temporal attention to model contextual relationships and refines segmental features to encourage more distinguishable segment classification. In terms of blur boundaries between actions and camouflage background in complex scenes and easily resulting in wrong positioning, we construct a sparse graph focusing on the effective representation of context motion by optical flow modal learning, further enhancing the representation of the active region to be examined, and suppressing the interference from the background. Finally, with the idea of ”All roads lead to Rome”, we design motion-guided loss constraints to balance the long-short temporal module and graph reinforcement module, by which the two branches can converge to almost the same detection goal, thus to achieve an agreement of near ground truth localization result. The algorithm achieves mAP of 69.1% and 42.0% detection performances on the THUMOS14 and ActivityNet1.2 datasets respectively. We also verify its effectiveness by comparing it with the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call