Abstract
The human skeleton joints captured by RGB-D camera are widely used in action recognition for its robust and comprehensive 3D information. Presently, most action recognition methods based on skeleton joints treat all skeletal joints with the same importance spatially and temporally. However, the contributions of skeletal joints vary significantly. Hence, a GL-LSTM+Diff model is proposed to improve the recognition of human actions. A global spatial attention (GSA) model is proposed to express the different weights for different skeletal joints to provide precise spatial information for human action recognition. The accumulative learning curve (ALC) model is introduced to highlight which frames contribute most to the final decision making by giving varying temporal weights to each intermediate accumulated learning results. By integrating the proposed GSA (for spatial information) and ALC (for temporal processing) models into the LSTM framework and taking the human skeletal joints as inputs, a global spatio-temporal action recognition framework (GL-LSTM) is constructed to recognize human actions. Diff is introduced as the preprocessing method to enhance the dynamic of the features, thus to get distinguishable features in deep learning. Rigorous experiments on the largest dataset NTU RGB+D and the common small dataset SBU show that the algorithm proposed in this paper outperforms other state-of-the-art methods.
Highlights
Human action recognition has a wide range of applications [1], such as human-computer interaction, video surveillance, health care, entertainment, etc
The present paper proposes a global spatio-temporal attention model as shown in Figure 1, which takes all frames of each action as inputs and obtains the weight of each joint for action recognition
In order to further understand the effectiveness of the global spatial attention model, that is, what kind of action type it has better effect on, this paper examines the enhancement on NTU RGB+D dataset and sorts out top 10 actions that have better enhancement
Summary
Human action recognition has a wide range of applications [1], such as human-computer interaction, video surveillance, health care, entertainment, etc. In the sequence of actions, it may be completely different for the importance of each frame sequence to the recognition action; so does the effect of each joint on different actions In response to this problem, the mainstream practice at present is to embed the attention model into deep learning. Only after reading the entire sequence of actions in a complete way, can it be reliable to determine which moments of action are more important and which joints weight greater in action recognition Inspired by this observation, the present paper proposes a global spatio-temporal attention model as shown, which takes all frames of each action as inputs and obtains the weight of each joint for action recognition. Diff is proposed as the basic feature of deep learning, which significantly improves the effect of action recognition
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.