Abstract

In skeleton-based human action recognition methods, human behaviours can be analysed through temporal and spatial changes in the human skeleton. Skeletons are not limited by clothing changes, lighting conditions, or complex backgrounds. This recognition method is robust and has aroused great interest; however, many existing studies used deep-layer networks with large numbers of required parameters to improve the model performance and thus lost the advantage of less computation of skeleton data. It is difficult to deploy previously established models to real-life applications based on low-cost embedded devices. To obtain a model with fewer parameters and a higher accuracy, this study designed a lightweight frame-level joints adaptive graph convolutional network (FLAGCN) model to solve skeleton-based action recognition tasks. Compared with the classical 2s-AGCN model, the new model obtained a higher precision with 1/8 of the parameters and 1/9 of the floating-point operations (FLOPs). Our proposed network characterises three main improvements. First, a previous feature-fusion method replaces the multistream network and reduces the number of required parameters. Second, at the spatial level, two kinds of graph convolution methods capture different aspects of human action information. A frame-level graph convolution constructs a human topological structure for each data frame, whereas an adjacency graph convolution captures the characteristics of the adjacent joints. Third, the model proposed in this study hierarchically extracts different levels of action sequence features, making the model clear and easy to understand; further, it reduces the depth of the model and the number of parameters. A large number of experiments on the NTU RGB + D 60 and 120 data sets show that this method has the advantages of few required parameters, low computational costs, and fast speeds. It also has a simple structure and training process that make it easy to deploy in real-time recognition systems based on low-cost embedded devices.

Highlights

  • Human action recognition can be used in various scenes, such as video retrievals and human-computer interactions [1], so it has been widely discussed in the literature

  • To solve the problems described above, this study proposed a lightweight hierarchical model called a frame-level joints adaptive graph convolutional network (FLAGCN)

  • In traditional skeletonbased human action recognition methods, the skeleton is treated as structured data similar to an image, and the spatial relationships between joints are ignored. e ST-Graph convolutional neural network (GCN) introduced a graph convolutional neural network and defined a spatiotemporal skeleton sequence composed of nodes and edges, where nodes refer to the joints in the skeleton and edges are divided into two categories

Read more

Summary

Introduction

Human action recognition can be used in various scenes, such as video retrievals and human-computer interactions [1], so it has been widely discussed in the literature. Due to the limitations of data sets, skeleton-based human action recognition researchers have mainly used manual feature-extraction and machinelearning methods. Yan et al first applied a graph convolution method in a skeleton-based human action recognition study [19] and proposed a spatiotemporal graph convolutional network (ST-GCN). Ese methods make the network deeper and the structure of each layer more complex; they often introduce many parameters and extremely difficult training processes and frequently require many computing resources and long training times. These methods place high demands on the computing performance of the utilised equipment and take a long time to predict action sequences in practical applications. In the model proposed in this study, the three-dimensional coordinate features of joints are mainly extracted at the point level, whereas the spatial features of all joints in each frame are extracted at the spatial level and the temporal features of the whole sequence are extracted at the temporal level. erefore, the model is simple, clear, and easy to understand. e ablation experiment confirms that the layered feature-extraction process utilised in this model can effectively improve the recognition accuracy of skeletons with a small number of required parameters

Related Work
Methodologies
Point Level
Spatial Level
Coordinates embedding Features fusion
Experiment
Ablation Study
Findings
Method
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call