Abstract

Existing RGB and CNN-based methods in video action recognition mostly do not distinguish human body from the environment, thus easily overfit the scenes and objects of training sets. In this work, we present a conceptually simple, general and high-performance framework for action recognition in videos, aiming at person-centric modeling. The method, called Action Machine, is based on person bounding boxes for instance-level action analysis. It extends the Inflated 3D ConvNet (I3D) by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition. Action Machine can benefit from the multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses. Experiments results are provided on trimmed video action datasets, NTU RGB+D, Northwestern UCLA Multiview Action3D, MSR Daily Activity3D. Action Machine achieves superior performance and generalizes well across datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.