Abstract

Human skeleton sequences have enough spatial and temporal information for action recognition. However, the features in different spatial parts of skeleton play distinctive roles in the different temporal phases of the action. Effectively extracting spatial and temporal features becomes a challenging task. This paper proposes a novel Multi-Scale Spatial Temporal Graph Convolutional LSTM Network (M-GC-LSTM) approach, which employs multi-neighborhood graph convolution and multiple LSTMs with different time-windows to increase the spatial and temporal receptive fields of network simultaneously. In order to overcome the over-shoot problem in deep GCN networks, we propose a parallel multi-GCN module (M-GC) to accomplish the multi-neighborhood convolution process, which makes the GCN network wider instead of deeper. A LSTM module with attention gate is also proposed to promote the representation ability for long-term information. Multi-scale LSTMs are designed to form a M-LSTM module, which is used to improve the ability of extracting different scale temporal dynamics. Furthermore, a regularized cross-entropy loss is proposed to optimize the training process. The superiority of the proposed method is demonstrated by comparing with mainstream methods on two large scale datasets: NTU-RGBD and kinetics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call