Abstract

In skeleton-based human action recognition, spatial-temporal graph convolution networks (ST-GCNs) have achieved remarkable performances recently. However, how to explore more discriminative spatial and temporal features is still an open problem. The temporal graph convolution of the traditional ST-GCNs utilizes only one fixed kernel which cannot completely cover all the important stages of each action execution. Besides, the spatial and temporal graph convolution layers (GCLs) are serial connected, which mixes information of different domains and limits the feature extraction capability. In addition, the input features like joints, bones, and their motions are modeled in existing methods, but more input features are needed for better performance. To this end, this article proposes a novel multi-stream and enhanced spatial-temporal graph convolution network (MS-ESTGCN). For each basic block of MS-ESTGCN, densely connected multiple temporal GCLs with different kernel sizes are employed to aggregate more temporal features. To eliminate the adverse impact of information mixing, an additional spatial GCL branch is added to the block and the spatial features can be enhanced. Furthermore, we extend the input features by employing relative positions of joints and bones. Consequently, there are totally six data modalities (joints, bones, their motions and relative positions) that can be fed into the network independently with a six-stream paradigm. The proposed method is evaluated on two large scale datasets: NTU-RGB+D and Kinetics-Skeleton. The experimental results show that our method using only two data modalities delivers state-of-the-art performance, and our methods using four and six data modalities further exceed other methods with a significant margin.

Highlights

  • Human action recognition which aims to accurately classify human actions [1], plays an essential role in video surveillance, pedestrian tracking, health care systems, virtual reality, and human-computer interaction [2]–[13]

  • To address the aforementioned issues, we propose a novel model named as multi-stream and enhanced spatial-temporal graph convolution network (MS-ESTGCN)

  • We propose MS-ESTGCN which consists of multiple novel constructed blocks with both spatial and temporal features enhanced

Read more

Summary

INTRODUCTION

Human action recognition which aims to accurately classify human actions [1], plays an essential role in video surveillance, pedestrian tracking, health care systems, virtual reality, and human-computer interaction [2]–[13]. Action recognition based on skeleton becomes an attractive and popular research domain [18]–[23] Conventional methods in this domain [24]–[26] usually use handcraft features such as joint angles, distances, and kinematics for human body modeling. To address the aforementioned issues, we propose a novel model named as multi-stream and enhanced spatial-temporal graph convolution network (MS-ESTGCN). The input features consist of six modalities in total: joints, bones, their motions and relative positions. This makes our model a six-stream network, and the final results can be obtained by adding the softmax scores of each stream. We propose MS-ESTGCN which consists of multiple novel constructed blocks with both spatial and temporal features enhanced. Our models using four and six modalities further exceed other models with a significant margin

RELATED WORK
METHODS
TEMPORAL GRAPH CONVOLUTION LAYERS
EXPERIMENTS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call