Abstract
In the skeleton-based human action recognition domain, the spatial-temporal graph convolution networks (ST-GCNs) have made great progress recently. However, they use only one fixed temporal convolution kernel, which is not enough to extract the temporal cues comprehensively. Moreover, simply connecting the spatial graph convolution layer (GCL) and the temporal GCL in series is not the optimal solution. To this end, we propose a novel enhanced spatial and extended temporal graph convolutional network (EE-GCN) in this paper. Three convolution kernels with different sizes are chosen to extract the discriminative temporal features from shorter to longer terms. The corresponding GCLs are then concatenated by a powerful yet efficient one-shot aggregation (OSA) + effective squeeze-excitation (eSE) structure. The OSA module aggregates the features from each layer once to the output, and the eSE module explores the interdependency between the channels of the output. Besides, we propose a new connection paradigm to enhance the spatial features, which expand the serial connection to a combination of serial and parallel connections by adding a spatial GCL in parallel with the temporal GCLs. The proposed method is evaluated on three large scale datasets, and the experimental results show that the performance of our method exceeds previous state-of-the-art methods.
Highlights
Human action recognition has many application scenarios in the real world, such as security surveillance, health care systems, autonomous driving and human-computer interaction [1,2,3,4,5]
Our work focuses on the task of skeleton-based action recognition
To address the above two problems, we propose a novel model namely enhanced spatial and extended temporal graph convolutional network (EE-GCN)
Summary
Human action recognition has many application scenarios in the real world, such as security surveillance, health care systems, autonomous driving and human-computer interaction [1,2,3,4,5]. Convolutional Neural Networks (CNN), which have achieved much better performance than hand-crafted methods [20,21,22] Whether they model the skeleton data as a sequence of vectors like the RNNs do, or model them as 2D pseudo images like the CNNs do, they all neglect that the human skeleton is naturally a non-Euclidean graph-structured composed of vertices and edges. Based on this judgment, Yan et al [23] proposed a spatial-temporal graph convolutional network (ST-GCN) representing human joints as vertices and the bones as edges. With the combination of tandem and parallel relationships between the spatial and temporal GCLs, our model is much more effective for capturing complex spatial-temporal features
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have