Abstract

Video-based action recognition is a challenging problem due to the rapid and uncertain changes in human actions. Recent studies show that incorporating video and human body skeleton helps improve action recognition performance. These methods generally use graph convolutional networks (GCNs) to extract structural features of the human body joints from skeleton data. Yet, most GCN-based methods have some limitations in skeleton-based action recognition. (1) The graph structure of the human body joints is time-invariant, making it difficult to represent the changing relationship between joints across actions. (2) Methods relying on single-stream models only utilize limited information of skeleton data, such as joints or bones, and fail to consider coherent features of movements. (3) Methods relying on multi-stream models have considerable parameters and are inefficient for real-life applications. To address these problems, we propose an adaptive spatiotemporal graph convolutional network with intermediate aggregation of multi-stream skeleton features for action recognition. First, our method learns an adaptive graph structure representing the changing relationship between joints. Secondly, we facilitate a multi-stream model to extract various features from the skeleton, including joint-stream, bone-stream, and motion-stream. Moreover, an intermediate aggregation strategy is employed to aggregate these features and to reduce the parameters of this model. The proposed method has been validated on various benchmarks and a real-world abnormal action dataset. Extensive experimental results show that our method achieves excellent performance in skeleton-based action recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call