Abstract

Human skeleton contains intuitive information of actions and has high robustness in dynamic environment. Therefore, it has been widely studied in action recognition tasks. Most existing methods of skeleton recognition are based on graph convolutional networks (GCNs), which extract the topological structure of graphs to describe the dependencies between joints. However, the GCNs pay excessive attention to skeleton structure and neglect the feature information of skeleton joints. Accordingly, how to fuse the feature of both skeleton structure and joints is a problem to be solved. In addition, non-linear temporal convolutional network (TCN), which has more robustness and learning capability, is rarely investigated in existing methods. With the comprehensive consideration of the dependence between structure and feature on graphs, we propose a novel structure-feature fusion adaptive GCN (SFAGCN) for skeleton based action recognition. The topological structure of the skeleton graph and the feature of the joints can be fused by the decoupled spatiotemporal correlation in our model effectively. The relevance of spatiotemporal data can be preserved well by the fusion strategy, with data integrity ensured. Moreover, Gated TCN is used to extract temporal feature, improving the network performance further. We choose two-stream adaptive GCNs and shift-GCN as the baseline. To demonstrate the effectiveness of our methods, extensive experiments are implemented on the three large-scale datasets, namely, NTU-RGBD 60, NTU-RGBD 120 and Kinetics-Skeleton 400. The accuracy of top-1 on above datasets are improved by more than 0.6% on average, where the performance of SFAGCN exceeds the state-of-the-art methods.

Highlights

  • Human action recognition is widely applied in video surveillance and human-computer interaction

  • The main contributions of this work are as follows: (1) We propose a novel, adaptive structure and feature fusion framework, which can well inherit the advantages of Multilayer perceptron (MLP) and graph convolutional networks (GCNs)

  • GCNs are applied to feature extraction to obtain the structural properties of graphs in non-Euclidean domain, which can not be realized by recurrent neural networks (RNNs) or convolution neural networks (CNNs)

Read more

Summary

INTRODUCTION

Human action recognition is widely applied in video surveillance and human-computer interaction. Yan et al [8] apply GCNs to extract spatial features of skeleton data, and add temporal edges between the corresponding joints in consecutive frames by temporal convolutional networks (TCNs). They further build the final spatiotemporal graph convolutional network (ST-GCN). The feature of joints contains the position information of human skeleton, which is an essential evaluation criterion for action recognition It is of great importance for a critical network to extract this feature. The skeleton structure and joints feature extracted by GCN and MLP, respectively, are fused with above attention weights as shown in Fig.. The performances on three large-scale datasets exceed the stateof-the-art

RELATED WORK
STRUCTURE-FEATURE CONNECTION MECHANISM
PRELIMINARIES
RELATION TO PRIOR WORKS
ADAPTIVE STRUCTURE AND FEATURE FUSION BLOCK
GATED TEMPORAL CONVOLUTION NETWORK
EXPERIMENTS
TRAINING CONFIGURATIONS
Method
Methods
COMPARISON WITH THE STATE-OF-THE-ART
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.