Abstract

Human action recognition (HAR) by skeleton data is considered a potential research aspect in computer vision. Three-dimensional HAR with skeleton data has been used commonly because of its effective and efficient results. Several models have been developed for learning spatiotemporal parameters from skeleton sequences. However, two critical problems exist: (1) previous skeleton sequences were created by connecting different joints with a static order; (2) earlier methods were not efficient enough to focus on valuable joints. Specifically, this study aimed to (1) demonstrate the ability of convolutional neural networks to learn spatiotemporal parameters of skeleton sequences from different frames of human action, and (2) to combine the process of all frames created by different human actions and fit in the spatial structure information necessary for action recognition, using multi-task learning networks (MTLNs). The results were significantly improved compared with existing models by executing the proposed model on an NTU RGB+D dataset, an SYSU dataset, and an SBU Kinetic Interaction dataset. We further implemented our model on noisy expected poses from subgroups of the Kinetics dataset and the UCF101 dataset. The experimental results also showed significant improvement using our proposed model.

Highlights

  • At present, human action recognition (HAR) is an excellent area for research in computer vision

  • Three-dimensional skeleton data analysis of the path of human skeleton joints is less susceptible to brightness variations and never changes camera views [1]

  • With respect to the long-term temporal dependency issues, long short-term memory (LSTM) networks are implemented to remember the information of the entire sequences across various periods; they still face some difficulties [18]

Read more

Summary

Introduction

Human action recognition (HAR) is an excellent area for research in computer vision. To investigate the spatial and temporal characteristics of human interaction, studies have utilized recurrent neural networks (RNNs) with long short-term memory (LSTM) [13,14] neurons for joints of the skeleton sequence [2,15,16,17]. With respect to the long-term temporal dependency issues, LSTM networks are implemented to remember the information of the entire sequences across various periods; they still face some difficulties [18]. CNNs [20] have the capability to perfect long-term temporal dependency of a complete video [21]. The CNN edges are used to efficiently access the long-term temporal shape from the skeleton sequence. The human skeleton is viewed as an entire joint of pictures

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call