Contrastive 3D Human Skeleton Action Representation Learning via CrossMoCo With Spatiotemporal Occlusion Mask Data Augmentation

Qinyang Zeng,Ming Liu,Chengju Liu,Qijun Chen

doi:10.1109/tmm.2023.3253048

Abstract

Self-supervised learning methods for 3D skeleton-based action recognition via contrastive learning have obtained competitive achievements compared to classical supervised methods. Current researches show that adding a Multilayer Perceptron (MLP) to the top of the base encoder can extract high-level and global positive representations. Using a negative memory bank to store negative samples dynamically can balance the ample storage and feature consistency. However, these methods need to consider that the MLP lacks accurate encoding of fine-grained local features, and a memory bank needs rich and diverse negative sample pairs to match positive representations from different encoders. This paper proposes a new method called Cross Momentum Contrast (CrossMoCo), composed of three parts: ST-GCN encoder, ST-GCN encoder with MLP encoder (ST-MLP encoder), and two independent negative memory banks. The two encoders encode the input data into two positive feature pairs. Learning the cross representations of the two positive pairs is helpful for the model to extract both the global and the local information. Two independent negative memory banks update the negative samples according to different positive representations from two encoders, diversifying the negative samples' distribution and making negative representations close to the positive features. The increasing classification difficulty will improve the model's ability of contrastive learning. In addition, the spatiotemporal occlusion mask data augmentation method is used to enhance positive samples' information diversity. This method takes the adjacent skeleton joints that can form a skeleton bone as a mask unit, which can reduce the information redundancy after data augmentation since adjacent joints may carry similar spatiotemporal information. Experiments on the PKU-MMD Part II dataset, the NTU RGB+D 60 dataset, and the NW-UCLA dataset show that the CrossMoCo framework with spatiotemporal occlusion mask data augmentation has achieved a comparable performance.

Full Text