Skeleton-based action recognition aims to recognize human actions from the coordinates of human joints. By encoding coordinates as joint tokens, previous methods have successfully utilized the self-attention (SA) mechanism to capture the relationship of each pair of joints. However, the attention map generated from the joint Query and joint Key in SA only captures joint-to-joint correlations at a single granularity, which is obviously insufficient for human actions that express semantics in terms of body parts. In this paper, we argue that SA should have a more comprehensive mechanism to capture correlations in joint-to-joint and joint-to-partition patterns for a higher semantic representation of skeleton-based actions. Therefore, we propose Joint-Partition Group Attention (JPGA) to simultaneously capture correlations between joints and body parts of different granularity sizes. Specifically, JPGA integrates the joint tokens according to the joint’s human body partition attributes and produces different body parts tokens (partition-tokens) with different granularities. Then the attention map of JPGA is computed from joint-token and partition-token of different granularity sizes to represent the relationship between joints and body parts. To adaptively partition human body parts at different granularities, we apply the reparameterization trick to adaptively learn the multi-granularity partitioning matrix. Based on JPGA, we construct our Joint-Partition Former (JPFormer), and conduct extensive experiments on NTU-RGB+D, NTU-RRB+D 120, and Northwestern UCLA datasets and achieve state-of-the-art results, which highlights the effectiveness of our design and practice.
Read full abstract