Abstract
Modeling spatiotemporal global dependencies and dynamics of body joints are crucial to recognizing actions from 3D skeleton sequences. We propose a Relation-mining Self-Attention Network (RSA-Net) for skeleton-based human action recognition. The proposed RSA-Net is motivated by two important observations: (1) body joint relationships can be modeled independently as pairwise and unary to reduce the difficulty of action feature learning. (2) Computing action semantics and position information independently removes noisy correlations over heterogeneous embedding. The proposed RSA-Net contains pairwise self-attention, unary self-attention, and position embedding attention modules. The pairwise self-attention captures the relationship between every two body joints. The unary self-attention learns a general correlation features among one key joint over all other query joints. The position embedding attention module computes the correlation between action semantics and position information independently with separate projection matrices. Extensive evaluations are performed in the NTU-60, NTU-120, and UESTC datasets with CS, CV, CSet, and A-view evaluation benchmarks. The proposed RSA-Net outperforms existing transformer-based approaches and comparable results with state-of-the-art graph ConvNet methods. The source code is available in Github1.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have