Abstract

Action recognition based on 3D skeleton sequences has gained considerable attention in recent years. Due to effectively representing the spatial and the temporal characters of skeleton sequences, the Covariance Matrix (CM) features combined with the Long Short-Term Memory (LSTM) network is an effective and reasonable roadmap to enhance the action recognition accuracy. However, the CM features in the existing recognition models are computed from the raw data without normalization or with static normalization. Moreover, a CM feature is calculated from all coordinates in one frame, treating all coordinates in three axes identically and neglecting the relationship of the coordinates in the same axe. In this paper, an end to end deep learning framework is proposed that includes a normalization layer dynamically adapting to data distribution and inference procedure. After normalization, the three covariance feature sequences from the coordinates in three axes are produced from the sliding windows and are fused into one fusion matrix using a convolution layer. Finally, the fusion matrix is sequentially fed into an LSTM network to recognize skeleton action. The novelty of the proposed framework is combining the adaptive preprocessing and the features fusion to the LSTM network and improving the recognition accuracy by optimizing the quality of the features rather than network construction. In the experiments, the proposed framework is verified on the public datasets and one student action dataset collected from a real classroom. The experimental results demonstrate that the proposed method achieves a significant improvement in accuracy compared to the state-of-the-art methods. It can be concluded that the proposed framework can not only accurately capture the correlation of joints in the same frame but can also effectively express the dependences of sequential frames.

Highlights

  • Human action recognition has a wide range of applications, including video surveillance, human-machine interaction, interactive entertainment and multimedia information retrieval [1]

  • Combining the Covariance Matrix (CM) and the Long-Short Term Memory (LSTM) network is the effective and reasonable roadmap to enhance the accuracy of action recognition

  • To dynamically normalize the skeleton data, this paper proposes a dynamic normalization module that is capable of normalizing the data adaptively during inference according to the distribution of the measurements of the current skeletal data

Read more

Summary

INTRODUCTION

Human action recognition has a wide range of applications, including video surveillance, human-machine interaction, interactive entertainment and multimedia information retrieval [1]. It is difficult to construct deep LSTM networks to learn highlevel features of skeleton data. Yan: Optimizing Features Quality: Normalized Covariance Fusion Framework for Skeleton Action Recognition. The performance can slightly be improved through a deep learning schema, the space complexity is usually high From tasks, such as action recognition based on skeletal data, some important information may be lost when normalized in the traditional manner [10]. The traditional methods use all coordinates of all joints in one frame to compute the CM feature, which is not reasonable. The limitation of the existing methods includes: Firstly, CM features are computed from raw data without normalization, or with static normalization.

RELATED WORKS
SLIDING WINDOW DESIGN FOR CM FEATURE
EXPERIMENTAL RESULTS
DATASETS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call