Abstract

<abstract><p>Skeleton-based action recognition is an important but challenging task in the study of video understanding and human-computer interaction. However, existing methods suffer from two deficiencies. On the one hand, most methods usually involve manually designed convolution kernel which cannot capture spatial-temporal joint dependencies of complex regions. On the other hand, some methods just use the self-attention mechanism, ignoring its theoretical explanation. In this paper, we proposed a unified spatio-temporal graph convolutional network with a self-attention mechanism (SA-GCN) for low-quality motion video data with fixed viewing angle. SA-GCN can extract features efficiently by learning weights between joint points of different scales. Specifically, the proposed self-attention mechanism is end-to-end with mapping strategy for different nodes, which not only characterizes the multi-scale dependencies of joints, but also integrates the structural features of the graph and an ability of self-learning fusion features. Moreover, the attention mechanism proposed in this paper can be theoretically explained by GCN to some extent, which is usually not considered in most existing models. Extensive experiments on two widely used datasets, NTU-60 RGB+D and NTU-120 RGB+D, demonstrated that SA-GCN significantly outperforms a series of existing mainstream approaches in terms of accuracy.</p></abstract>

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.