Abstract

The human skeleton joints captured by RGB-D camera are widely used in action recognition for its robust and comprehensive 3D information. Presently, most action recognition methods based on skeleton joints treat all skeletal joints with the same importance spatially and temporally. However, the contributions of skeletal joints vary significantly. Hence, a GL-LSTM+Diff model is proposed to improve the recognition of human actions. A global spatial attention (GSA) model is proposed to express the different weights for different skeletal joints to provide precise spatial information for human action recognition. The accumulative learning curve (ALC) model is introduced to highlight which frames contribute most to the final decision making by giving varying temporal weights to each intermediate accumulated learning results. By integrating the proposed GSA (for spatial information) and ALC (for temporal processing) models into the LSTM framework and taking the human skeletal joints as inputs, a global spatio-temporal action recognition framework (GL-LSTM) is constructed to recognize human actions. Diff is introduced as the preprocessing method to enhance the dynamic of the features, thus to get distinguishable features in deep learning. Rigorous experiments on the largest dataset NTU RGB+D and the common small dataset SBU show that the algorithm proposed in this paper outperforms other state-of-the-art methods.

Highlights

  • Human action recognition has a wide range of applications [1], such as human-computer interaction, video surveillance, health care, entertainment, etc

  • The present paper proposes a global spatio-temporal attention model as shown in Figure 1, which takes all frames of each action as inputs and obtains the weight of each joint for action recognition

  • In order to further understand the effectiveness of the global spatial attention model, that is, what kind of action type it has better effect on, this paper examines the enhancement on NTU RGB+D dataset and sorts out top 10 actions that have better enhancement

Read more

Summary

INTRODUCTION

Human action recognition has a wide range of applications [1], such as human-computer interaction, video surveillance, health care, entertainment, etc. In the sequence of actions, it may be completely different for the importance of each frame sequence to the recognition action; so does the effect of each joint on different actions In response to this problem, the mainstream practice at present is to embed the attention model into deep learning. Only after reading the entire sequence of actions in a complete way, can it be reliable to determine which moments of action are more important and which joints weight greater in action recognition Inspired by this observation, the present paper proposes a global spatio-temporal attention model as shown, which takes all frames of each action as inputs and obtains the weight of each joint for action recognition. Diff is proposed as the basic feature of deep learning, which significantly improves the effect of action recognition

RELATED WORK
HANDCRAFTED DYNAMIC FEATURE
GLOBAL SPATIAL ATTENTION MODEL
INTEGRATION OF SPATIAL MODEL AND TEMPORAL MODEL
DIFF FOR TEMPORAL DYNAMIC FEATURE
EXPERIMENTAL EVALUATION
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.