Abstract

Pose-based action recognition has always been an important research field in computer vision. However, most existing pose-based methods are built upon human skeleton data, which cannot be used to exploit the feature of the motion-related object, i.e., a crucial clue of discriminating human actions. To address this issue, we propose a novel pose-flow relational model, which can benefit from both pose dynamics and optical flow. First, we introduce a pose estimation module to extract the skeleton data of the key person from the raw video. Second, a hierarchical pose-based network is proposed to effectively explore the rich spatial–temporal features of human skeleton positions. Third, we embed an inflated 3D network to capture the subtle cues of the motion-related object from optical flow. Additionally, we evaluate our model on four popular action recognition benchmarks (HMDB-51, JHMDB, sub-JHMDB, and SYSU 3D). Experimental results demonstrate that the proposed model outperforms the existing pose-based methods in human action recognition.

Highlights

  • In the past few years, human action recognition in videos has gained increasing attention for its wide range of applications in smart surveillance systems, human–computer interaction, and motion analysis.1–4 There still exist some challenges in practical applications, such as interference of complex background, body occlusion due to camera position, and pattern blurring due to fast motion

  • Compared with the appearance-based method which uses a red–green–blue (RGB) image or optical flow as the input of the model, the pose-based action recognition method10 can effectively capture rich spatial–temporal cues of human actions without being interfered with irrelevant information. The evolution in this direction is mainly attributed to the emergence of deep neural networks, e.g., convolutional neural network (CNN), recurrent neural network (RNN), and 3D convolutional neural networks

  • Optical flow is usually represented by a color image with Hue-Saturation-Value (HSV) color space, while human pose is represented by the positions of human skeleton joints

Read more

Summary

INTRODUCTION

In the past few years, human action recognition in videos has gained increasing attention for its wide range of applications in smart surveillance systems, human–computer interaction, and motion analysis. There still exist some challenges in practical applications, such as interference of complex background, body occlusion due to camera position, and pattern blurring due to fast motion. Compared with the appearance-based method which uses a red–green–blue (RGB) image or optical flow as the input of the model, the pose-based action recognition method can effectively capture rich spatial–temporal cues of human actions without being interfered with irrelevant information. The evolution in this direction is mainly attributed to the emergence of deep neural networks, e.g., convolutional neural network (CNN), recurrent neural network (RNN), and 3D convolutional neural networks.. A hierarchical pose-based network is proposed to explore the rich spatial–temporal features of human skeleton data. Experimental results show that the proposed model outperforms the existing pose-based methods in human action recognition

Human pose estimation
Pose-based action recognition in video
Feature aggregation
Overview
Pose estimation module
Pose-flow relational model
Feature aggregation for action recognition
EXPERIMENTS
Datasets
Implementation details
Ablation studies
Method
Comparison on class accuracies
Comparison with the state-of-the-art methods
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.