Action Machine: Toward Person-Centric Action Recognition in Videos

Jiagang Zhu,Guan Huang,Zheng Zhu,Liang Xu,Wei Zou

doi:10.1109/lsp.2019.2942739

Abstract

Existing RGB and CNN-based methods in video action recognition mostly do not distinguish human body from the environment, thus easily overfit the scenes and objects of training sets. In this work, we present a conceptually simple, general and high-performance framework for action recognition in videos, aiming at person-centric modeling. The method, called Action Machine, is based on person bounding boxes for instance-level action analysis. It extends the Inflated 3D ConvNet (I3D) by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition. Action Machine can benefit from the multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses. Experiments results are provided on trimmed video action datasets, NTU RGB+D, Northwestern UCLA Multiview Action3D, MSR Daily Activity3D. Action Machine achieves superior performance and generalizes well across datasets.

Full Text