Abstract

Skeleton-based human action recognition is attracting increasing attention and is widely applied in virtual reality, human–computer interaction system and other cases. Nevertheless, the performance of recent methods on actions with similar appearance features is still barely satisfactory due to their inherent weakness in modeling the discriminative temporal dynamics. Besides, previous methods have limitation in mining the global information of action. To this end, a multiple temporal scale aggregation graph convolutional network is proposed. Firstly, taking advantage of the varying temporal resolutions offered by different layers in the graph convolutional network, we develop a multiple temporal scale aggregation module to extract discriminative temporal feature. Secondly, a new skeleton feature representation method termed as relative joint across frames is proposed, which provides more global structure clue than the absolute coordinates. Furthermore, we propose a five-stream structure that comprehensively models complementary features and eventually achieves a significant performance boost. In the empirical experiments, our method shows an improvement of 2.38% and 4.08% compared to the baseline method on the cross-subject evaluation benchmark of NTU-RGB+D 60 and NTU-RGB+D 120, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call