Abstract

Detecting human activities in untrimmed video is a significant yet challenging task. Existing methods usually generate temporal action proposals via searching extensively at multiple preset scales or combining a bunch of short video snippets. However, we argue that the localization of action instances should be a process of observation, refinement and determination: observe the attended temporal window, refine its position and scale, then determine whether a true action region has been accurately found. To this end, we formulate temporal action localization task as a Markov Decision Process, and propose an active temporal action proposal model based on reinforcement learning. Our model learns to localize actions in videos by automatically adjusting the position and span of temporal window via a sequence of transformations. We train an action/non-action binary classifier to determine whether a temporal window contains an action instance. Validation results on THUMOS'14 dataset show that our proposed method achieves competitive performance both in accuracy and efficiency compared with some state-of-the-art methods, while using much less proposals.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call