Abstract

In recent years, weakly-supervised temporal action localization (WS-TAL) with only video-level annotations, which aims to learn whether each untrimmed video contains action frames gains more attention. Existing most WS-TAL methods especially rely on features learned for action localization. Therefore, it is important to improve the ability to separate the frames of action instances from the background frames. To address this challenge, this paper introduces a framework that learns two extra constraints, Action-Background Learning and Action-Foreground Learning. The former aims at maximizing the discrepancy inside the feature of action and background while the latter avoids the misjudgement of action instance. We evaluate the proposed model on two benchmark datasets, and the experimental results show that the method could gain comparable performance with current state-of-the-art WS-TAL methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call