Abstract

Skeleton-based human action recognition has caused wide concern, as skeleton data can robustly adapt to dynamic circumstances such as camera view changes and background interference thus allowing recognition methods to focus on robust features. In recent studies, the human body is modeled as a topological graph, and the graph convolution network (GCN) is used to extract features of actions. Although GCN has a strong ability to learn spatial modes, it ignores the varying degrees of higher-order dependencies that are captured by message passing. Moreover, the joints represented by vertices are interdependent, and hence incorporating an attention mechanism to weigh dependencies is beneficial. In this work, we propose a kernel attention adaptive graph transformer network (KA-AGTN), which models the higher-order spatial dependencies between joints by the graph transformer operator based on multihead self-attention. In addition, the Temporal Kernel Attention (TKA) block in KA-AGTN generates a channel-level attention score using temporal features, which can enhance temporal motion correlation. After combining the two-stream framework and adaptive graph strategy, KA-AGTN outperforms the baseline 2s-AGCN by 1.9% and by 1% under X-Sub and X-View on the NTU-RGBD 60 dataset, by 3.2% and 3.1% under X-Sub and X-Set on the NTU-RGBD 120 dataset, and by 2% and 2.3% under Top-1 and Top-5 and achieves the state-of-the-art performance on the Kinetics-Skeleton 400 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call