AbstractSkeleton‐based action recognition has been continuously and intensively studied. However, dynamic 3D skeleton data are difficult to be popularized in practical applications due to the restricted data acquisition conditions. Although the action recognition method based on 2D pose information extracted from RGB video can effectively avoid the influence of complex background, it is susceptible to factors such as video jitter and joint overlap. To reduce the interference of the aforementioned factors, we use two‐dimensional skeletal joint coordinate modal information to represent the changes in human body posture. First, we use a target detector and pose estimation algorithm to obtain the joint coordinates of each frame sample from RGB video. Then the feature extraction network is combined to perform multi‐level feature learning to establish correspondence between actions and corresponding multi‐level features. Finally, the hierarchical attention mechanism is introduced to design the model named CHAN. By calculating the association between elements, the weight of the action classification is redistributed. Extensive experiments on three datasets demonstrate the effectiveness of our proposed method.
Read full abstract