Abstract

In skeleton-based human action recognition, Transformer, which models the correlations between joint pairs in global topology, has achieved remarkable results. However, compared to many researches on changing graph topology learning in GCN, Transformer self-attention ignores the topology of the skeleton graph when capturing the dependencies between joints. To address these problems, we propose a novel two-stream spatial Graphormer network (2s-SGR), which models joint and bone information using self-attention incorporating structural encodings. First, in the joint stream, while Transformer models joint correlations in the global topology of the space, the topology of the joints and the edge information of the bones are introduced into the self-attention through custom structural encodings. At the same time, joint motion information is modeled in spatial-temporal blocks. The added information on structure and motion can effectively capture the dependencies of nodes between frames and enhance feature representation. Second, for the second-order information of the skeleton, the bone stream adapts to the structure of the bone by adjusting the custom structural encodings. Finally, the global spatial-temporal features of joints and bones in the skeleton are fused and input into the classification network to obtain action recognition results. Extensive experiments on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics, demonstrate that the performance of the 2s-SGR proposed in this paper is at the state-of-the-art level and is effectively validated by ablation experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call