Abstract

Skeleton-based action recognition is an increasing attentioned task that analyses spatial configuration and temporal dynamics of a human action from skeleton data, which has been widely applied in intelligent video surveillance and human-computer interaction. How to design an effective framework to learn discriminative spatial and temporal characteristics for skeleton-based action recognition is still a challenging problem. The shape and motion representations of skeleton sequences are the direct embodiment of spatial and temporal characteristics respectively, which can well address for human action description. In this work, we propose an original unified framework to learn comprehensive shape and motion representations from skeleton sequences by using Geometric Algebra. We firstly construct skeleton sequence space as a subset of Geometric Algebra to represent each skeleton sequence along both the spatial and temporal dimensions. Then rotor-based view transformation method is proposed to eliminate the effect of viewpoint variation, which remains the relative spatio-temporal relations among skeleton frames in a sequence. We also construct spatio-temporal view invariant model (STVIM) to collectively integrate spatial configuration and temporal dynamics of skeleton joints and bones. In STVIM, skeleton sequence shape and motion representations which mutually compensate are jointly learned to describe skeleton-based actions comprehensively. Furthermore, a selected multi-stream Convolutional Neural Network is employed to extract and fuse deep features from mapping images of the learned representations for skeleton-based action recognition. Experimental results on NTU RGB+D, Northwestern-UCLA and UTD-MHAD datasets consistently verify the effectiveness of our proposed method and the superior performance over state-of-the-art competitors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call