Graph Diffusion Convolutional Network for Skeleton Based Semantic Recognition of Two-Person Actions.

Shuai Li,Hong Qin,Xinxue He,Wenfeng Song,Aimin Hao

doi:10.1109/tpami.2023.3238411

Abstract

Graph Convolutional Networks (GCNs) have successfully boosted skeleton-based human action recognition. However, existing GCN-based methods mostly cast the problem as separated person's action recognition while ignoring the interaction between the action initiator and the action responder, especially for the fundamental two-person interactive action recognition. It is still challenging to effectively take into account the intrinsic local-global clues of the two-person activity. Additionally, message passing in GCN depends on adjacency matrix, but skeleton-based human action recognition methods tend to calculate the adjacency matrix with the fixed natural skeleton connectivity. It means that messages can only travel along a fixed path at different layers of the network or in different actions, which greatly reduces the flexibility of the network. To this end, we propose a novel graph diffusion convolutional network for skeleton based semantic recognition of two-person actions by embedding the graph diffusion into GCNs. At technical fronts, we dynamically construct the adjacency matrix based on practical action information, so that we can guide the message propagation in a more meaningful way. Simultaneously, we introduce the frame importance calculation module to conduct dynamic convolution, so that we can avoid the negative effect caused by the traditional convolution, wherein the shared weights may fail to capture key frames or be affected by noisy frames. Besides, we comprehensively leverage the multidimensional features related to joints' local visual appearances, global spatial relationship and temporal coherency, and for different features, different metrics are designed to measure the similarity underlying the corresponding real physical law of the motions. Moreover, extensive experiments and comprehensive evaluations on four public large-scale datasets (NTU-RGB+D 60, NTU-RGB+D 120, Kinetics-Skeleton 400, and SBU-Interaction) demonstrate that our method outperforms the state-of-the-art methods.

Full Text