Abstract

In this paper, we propose a simple yet effective self-supervised method called spatio-temporal contrastive learning (ST-CL) for 3D skeleton-based action recognition. ST-CL acquires action-specific features by regarding the spatio-temporal continuity of motion tendency as the supervisory signal. To yield effective representations, ST-CL first designs some novel contrastive proxy tasks by providing different spatio-temporal observation scenes for the same 3D action and pulling them together in the embedding space. Second, three key components are devised in the action encoding to efficiently extract representations in contrastive tasks: (1) Information Representation introduces the awareness of joint type when analyzing motion dynamics. (2) Non-local GCN learns a data-driven graph topology structure and promotes a spatial message passing among long-range joints in each frame. (3) Multi-Scale TCN makes larger receptive fields for capturing richer longe-range temporal dynamics amomg adjacent frames. In ST-CL, these effective proxy tasks yield useful representations and efficient action encoding further enhances the representation capacity. As validated on four large-scale datasets, ST-CL is a strong baseline with high performance and efficiency for the contrastive learning study of the skeleton data. Compared to previous self-supervised methods, the proposed ST-CL achieves significant improvement consistently with a smaller model size and better training efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call