Abstract

In the skeleton-based human action recognition domain, the spatial-temporal graph convolution networks (ST-GCNs) have made great progress recently. However, they use only one fixed temporal convolution kernel, which is not enough to extract the temporal cues comprehensively. Moreover, simply connecting the spatial graph convolution layer (GCL) and the temporal GCL in series is not the optimal solution. To this end, we propose a novel enhanced spatial and extended temporal graph convolutional network (EE-GCN) in this paper. Three convolution kernels with different sizes are chosen to extract the discriminative temporal features from shorter to longer terms. The corresponding GCLs are then concatenated by a powerful yet efficient one-shot aggregation (OSA) + effective squeeze-excitation (eSE) structure. The OSA module aggregates the features from each layer once to the output, and the eSE module explores the interdependency between the channels of the output. Besides, we propose a new connection paradigm to enhance the spatial features, which expand the serial connection to a combination of serial and parallel connections by adding a spatial GCL in parallel with the temporal GCLs. The proposed method is evaluated on three large scale datasets, and the experimental results show that the performance of our method exceeds previous state-of-the-art methods.

Highlights

  • Human action recognition has many application scenarios in the real world, such as security surveillance, health care systems, autonomous driving and human-computer interaction [1,2,3,4,5]

  • Our work focuses on the task of skeleton-based action recognition

  • To address the above two problems, we propose a novel model namely enhanced spatial and extended temporal graph convolutional network (EE-GCN)

Read more

Summary

Introduction

Human action recognition has many application scenarios in the real world, such as security surveillance, health care systems, autonomous driving and human-computer interaction [1,2,3,4,5]. Convolutional Neural Networks (CNN), which have achieved much better performance than hand-crafted methods [20,21,22] Whether they model the skeleton data as a sequence of vectors like the RNNs do, or model them as 2D pseudo images like the CNNs do, they all neglect that the human skeleton is naturally a non-Euclidean graph-structured composed of vertices and edges. Based on this judgment, Yan et al [23] proposed a spatial-temporal graph convolutional network (ST-GCN) representing human joints as vertices and the bones as edges. With the combination of tandem and parallel relationships between the spatial and temporal GCLs, our model is much more effective for capturing complex spatial-temporal features

B10 GPA Softmax score score
Related Work
Non-GCN-Based Methods
Improvements of GCN-Based Methods in the Spatial Domain
Improvements of GCN-Based Methods in the Temporal Domain
Network Architecture
Spatial
OSA Module
Experiments
Datasets
Training Details
Ablation Study
Section 3.2.
Temporal
Methods
The and EEGCN
Enhanced Spatial and Extended Temporal Graph Convolution
Comparison to Other State-of-the-Art Methods
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call