Abstract

Skeleton-based action recognition is widely used in varied areas such as human–machine interaction and virtual reality. Benefit from the powerful expression ability to depict structural data, graph convolutional networks (GCNs) have been developed to address this task by modeling the human body skeletons as spatial–temporal graphs. However, most existing GCN-based methods usually ignore the diversity of the motion information between channels of the input feature. And how to enhance the ability to capture the long-term global correlations in spatial and temporal dimensions is also a fundamental challenge. In this work, we propose a novel multi-stream framework Global–Local Motion Fusion Network (GLMFN), which integrates the global and local motion information of spatial–temporal dimensions. Specifically, we design a grouping graph convolution module to enforce the ability to aggregate local spatial motion information. Besides, to learn richer semantic features, we propose two modules based on the self-attention operator: a spatial self-attention module and a temporal self-attention module. The former is responsible for extracting spatial long-term motion relationships, while the latter aims to capture temporal long-term motion relationships. Moreover, we present a multi-stream fusion strategy with a series of treatments for body joints to achieve a better recognition effect. To validate the efficacy and efficiency of the proposed model, we perform exhaustive experiments on the NTU-RGBD dataset and NTU-RGBD-120 dataset, and our method achieves the state-of-the-art performance on both datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call