Abstract

Human pose has proved to be an effective representation for action recognition in video. However, traditional 2D features extracted from videos suffer from the high variations caused by viewpoint changes and projection. In this paper, we investigate the recent monocular 3D pose estimation technology for action recognition and perform a cross-modality analysis by comparing 2D, monocular 3D and Kinect 3D poses in terms of action recognition, especially under cross-viewpoint settings. We show that our proposed monocular 3D pose action recognition pipeline achieves superior results even without real depth information as input. Our proposed three-stream fusion of 3D pose, motion and appearance outperforms state-of-the-art methods on sub-JHMDB, Penn Action and NTU RGB+D datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call