Abstract

Existing RGB and CNN-based methods in video action recognition mostly do not distinguish human body from the environment, thus easily overfit the scenes and objects of training sets. In this work, we present a conceptually simple, general and high-performance framework for action recognition in videos, aiming at person-centric modeling. The method, called Action Machine, is based on person bounding boxes for instance-level action analysis. It extends the Inflated 3D ConvNet (I3D) by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition. Action Machine can benefit from the multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses. Experiments results are provided on trimmed video action datasets, NTU RGB+D, Northwestern UCLA Multiview Action3D, MSR Daily Activity3D. Action Machine achieves superior performance and generalizes well across datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call