Abstract

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB sensors using simple cameras. The approach proceeds along two stages. In the first, a real-time 2D pose detector is run to determine the precise pixel location of important keypoints of the human body. A two-stream deep neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second stage, the Efficient Neural Architecture Search (ENAS) algorithm is deployed to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that the method requires a low computational budget for training and inference. In particular, the experimental results show that by using a monocular RGB sensor, we can develop a 3D pose estimation and human action recognition approach that reaches the performance of RGB-depth sensors. This opens up many opportunities for leveraging RGB cameras (which are much cheaper than depth cameras and extensively deployed in private and public places) to build intelligent recognition systems.

Highlights

  • Human Action Recognition (HAR) from videos has been researched for decades, since this topic plays a key role in various areas such as intelligent surveillance, human–robot interaction, robot vision and so on

  • Depth sensors have some significant drawbacks with respect to 3D pose estimation

  • The proposed method achieves state-of-the-art result on 3D human pose estimation task and benefits action recognition

Read more

Summary

Introduction

Human Action Recognition (HAR) from videos has been researched for decades, since this topic plays a key role in various areas such as intelligent surveillance, human–robot interaction, robot vision and so on. Traditional approaches on video-based action recognition [1] have focused on extracting hand-crafted local features and building motion descriptors from RGB sensors. One of the major limitations of these approaches is the lack of 3D structure from the scene and recognizing human actions based only on RGB information is not enough to overcome the current challenges in the field. Most of the current depth sensors have integrated real-time skeleton estimation and tracking frameworks [5], facilitating the collection of skeletal data. This is a high-level representation of the human body, which is suitable for the problem of motion analysis. A major drawback of low-cost depth sensors is their inability to work in bright light, especially sunlight [11]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.