Abstract

The unsupervised representation learning for skeleton-based human action can be utilized in a variety of pose analysis applications. However, previous unsupervised methods focus on modeling the temporal dependencies in sequences, but take less effort in modeling the spatial structure in human action. To this end, we propose a novel unsupervised learning frame-work named Hierarchical Transformer for skeleton-based human action recognition. The Hierarchical Transformer consists of hierarchically aggregated self-attention modules for better capturing the spatial and temporal structure in the skeleton sequences. Furthermore, we propose to predict the motion between adjacent frames as a novel pre-training task for better capturing the long-term dependencies in sequences. Experimental results show that our method outperforms prior state-of-the-art unsupervised methods on NTU RGB+D and NW-UCLA datasets. Besides, our method also achieves state-of-the-art performance when the pre-trained model is transferred to SBU dataset, which demonstrates the generalizability of learned representation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.