Abstract

A fundamental bottleneck for achieving highly discriminative action representation is that local motion/appearance features are usually not semantic aligned. Namely, a local feature, such as a motion vector or motion trajectory, does not possess any attribute that indicates which moving body part or operated object it is associated with. This mostly leads to global feature pooling/representation learning methods that are often too coarse. Inspired by the recent success of end-to-end (pixel-to-pixel) deep convolutional neural networks (DCNNs), in this paper, we first propose a DCNN architecture, which maps a human centric image region onto human body part response maps. Based on these response maps, we propose a second DCNN, which achieves semantic-aligned feature representation learning. Prior knowledge that only a few parts are responsible for a certain action is also utilized by introducing a group (part) sparseness prior during feature learning. The learned semantic-aligned feature not only boosts the discriminative capability of action representation, but also possesses the good nature of robustness to pose variations and occlusions. Finally, an iterative mining method is employed for learning discriminative action primitive detectors. Extensive experiments on action recognition benchmarks demonstrate a superior recognition performance of the proposed framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call