Graph neural networks (GNNs) are extensively utilized to capture the spatial–temporal relationships among human body parts for skeleton-based action recognition. However, due to the inefficient information propagation caused by redundant sampling of video frames in the temporal domain, we focus on refining temporal graphs through key frame selection. To this end, we propose a multi-stage key frame selection (MSKFS) method, aiming to find the most representative frames as the graph nodes to learn compact temporal graph representations of human skeletons for action recognition. The MSKFS progressively selects key frames in two stages: (1) salient posture frame selection based on the global dynamics of body parts and (2) key frame refinement and alignment according to intra-frame correlations. The first stage captures the most salient information and aligns the corresponding information of skeleton sequences within the same category. The second stage enriches the subtle information for the integrity of the information derived by the salient frames. Moreover, variational inference is applied to differentiate the key frame refinement and alignment procedure, allowing the end-to-end optimization of arbitrary graph-based models to represent the obtained compact graph for skeleton-based action recognition. Our MSKFS method achieves state-of-the-art performances on two challenging action recognition datasets.
Read full abstract