Abstract

Skeleton-based action recognition is widely used in varied areas, e.g., surveillance and human-machine interaction. Existing models are mainly learned in a supervised manner, thus heavily depending on large-scale labeled data, which could be infeasible when labels are prohibitively expensive. In this paper, we propose a novel Contrast-Reconstruction Representation Learning network (CRRL) that simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition. It consists of three parts: Sequence Reconstructor (SER), Contrastive Motion Learner (CML), and Information Fuser (INF). SER learns representation from skeleton coordinate sequence via reconstruction. However the learned representation tends to focus on trivial postural coordinates and be hesitant in motion learning. To enhance the learning of motions, CML performs contrastive learning between the representation learned from coordinate sequences and additional velocity sequences, respectively. Finally, in the INF module, we explore varied strategies to combine SER and CML, and propose to couple postures and motions via a knowledge-distillation based fusion strategy which transfers the motion learning from CML to SER. Experimental results on several benchmarks, i.e., NTU RGB+D 60/120, PKU-MMD, CMU, and NW-UCLA, demonstrate the promise of the our method by outperforming state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call