Abstract

Abstract In this paper, we address the problem of action detection in unconstrained video clips. Our approach starts from action detection on object proposals at each frame, then aggregates the frame-level detection results belonging to the same actor across the whole video via linking, associating, and tracking to generate action tubes that are spatially compact and temporally continuous. To achieve the target, a novel action detection model with two-stream architecture is firstly proposed, which utilizes the fused feature from both appearance and motion cues and can be trained end-to-end. Then, the association of the action paths is formulated as a maximum set coverage problem with the results of action detection as a priori. We utilize an incremental search algorithm to obtain all the action proposals at one-pass operation with great efficiency, especially while dealing with the video of long duration or with multiple action instances. Finally, a tracking-by-detection scheme is designed to further refine the generated action paths. Extensive experiments on three validation datasets, UCF-Sports, UCF-101 and J-HMDB, show that the proposed approach advances state-of-the-art action detection performance in terms of both accuracy and proposal quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call