Improving self-supervised action recognition from extremely augmented skeleton sequences

Tianyu Guo,Mengyuan Liu,Hong Liu,Guoquan Wang,Wenhao Li

doi:10.1016/j.patcog.2024.110333

Abstract

Self-supervised contrastive learning has been widely applied to skeleton-based action recognition due to its ability to learn discriminative features. However, directly applying the existing contrastive learning framework for 3D skeleton learning is limited by the well-designed augmentations and the simple multi-stream decision-level fusion. To deal with these drawbacks, we propose a three-stream contrastive learning framework utilizing abundant information mining for self-supervised action representation (3s-AimCLR++). For single-stream contrastive learning, extreme augmentation is first proposed to generate more movement patterns, which can introduce more movement patterns to improve the universality of the learned representations. Since directly using extreme augmentation can barely boost the performance due to the drastic changes in original identity, the Distributional Divergence Minimization (DDM) loss is proposed to utilize the extreme augmentation more gently. Moreover, the Single-Stream Nearest Neighbors Mining (SNNM) is proposed to expand positive samples to make the learning process more reasonable. For multi-stream, existing methods simply ensemble the results. Yet, considering the complementarity of information between different streams, we propose Multi-Stream Aggregation and Interaction (MSAI) strategy to better fuse multi-stream information. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets have verified that our 3s-AimCLR++ can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols. The code and models are available at https://github.com/Levigty/AimCLR-v2.

Full Text