Abstract

The development of RGB-D sensors that have been widely applied in human motion collection is driving research in skeleton-based human action recognition. In recent works, most models based on the attention mechanism are proposed to assign the same weights to all body joints for spatiotemporal feature modeling. However, they fail to consider the fact that the differential contribution of joint points to the human movement, which is a challenge to obtain the high-level performance skeleton presentations. Therefore, in this article, we propose a novel sustained attention model based on the above fact, which adaptively assigns corresponding weights to all the body joints to extract the key skeleton part in the global input sequence. We design a two-stream network based on RNNs and CNNs and integrate the sustained attention mechanism into each subnetwork, in which both the body joint weights and the input frame weights are learned effectively and thus resulting in superior performance. Next, in the training process, the skeleton is randomly transformed to enhance the robustness of this model and reduce overfitting. A group of ablation studies and visualization analyses are conducted to prove the validity and robustness of the proposed model. Extensive experiments on four benchmark datasets included the challenging interaction datasets demonstrate that our proposed model outperforms recent state-of-the-art works.

Highlights

  • H UMAN action recognition has been widely applied in the areas of entertainment games, health care, remote video surveillance, smart home, and educational assistant [1]–[4]

  • For SA-LSTM, due to the more complex structure, overfitting may occur during the training process

  • The ideas of our work are shown as follow: 1) In SA-LSTM, we design two methods of combining attention mechanism with LSTM, that is, the attention model is introduced to assign the weight to each joint point, the attention model is introduced to assign the weight to each input frame; 2) In SA-Convolutional Neural Networks (CNNs), we re-design the structure of CNN by introducing the attention mechanism to assign the weight to each joint point; 3) In both subnetworks, we utilize the data enrichment method to increase more samples in the training process; 4) we fuse the results of both networks by introducing the weighted average algorithm to obtain the final recognition rate

Read more

Summary

INTRODUCTION

H UMAN action recognition has been widely applied in the areas of entertainment games, health care, remote video surveillance, smart home, and educational assistant [1]–[4]. Yang et al proposed a novel motion feature descriptor to calculate the differences of body joint points for skeleton-based human action recognition [17]. 1) We propose a sustained attention mechanism (SA) that assigns the corresponding weight to each body joint point adaptively to facilitate better human motion recognition from skeleton-based data, enabling the model to focus on the modeling of skeleton-specific features. This method avoids the artificial design of skeleton representation and emancipates the human energy.

RELATED WORK
TWO-STREAM FUSION
MODEL TRAINING AND DATA ENRICHMENT
EXPERIMENTS AND DISCUSSION
ABLATION STUDY
Methods
Findings
CONCLUSION AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.