Abstract

We address the problem of spatio-temporal action localization in videos in this paper. Current state-of-the-art methods for this challenging task rely on an object detector to localize actors at frame-level firstly, and then link or track the detections across time. Most of these methods commonly pay more attention to leveraging the temporal context of videos for action detection while ignoring the importance of the object detector itself. In this paper, we prove the importance of the object detector in the pipeline of action localization, and propose a strong object detector for better action localization in videos, which is based on the single shot multibox detector (SSD) framework. Different from SSD, we introduce an anchor refine branch at the end of the backbone network to refine the input anchors, and add a batch normalization layer before concatenating the intermediate feature maps at frame-level and after stacking feature maps at clip-level. The proposed strong detector have two contributions: (1) reducing the phenomenon of missing target objects at frame-level; (2) generating deformable anchor cuboids for modeling temporal dynamic actions. Extensive experiments on UCF-Sports, J-HMDB and UCF-101 validate our claims, and we outperform the previous state-of-the-art methods by a large margin in terms of frame-mAP and video-mAP, especially at a higher overlap threshold.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call