<abstract><p>Skeleton-based action recognition is an important but challenging task in the study of video understanding and human-computer interaction. However, existing methods suffer from two deficiencies. On the one hand, most methods usually involve manually designed convolution kernel which cannot capture spatial-temporal joint dependencies of complex regions. On the other hand, some methods just use the self-attention mechanism, ignoring its theoretical explanation. In this paper, we proposed a unified spatio-temporal graph convolutional network with a self-attention mechanism (SA-GCN) for low-quality motion video data with fixed viewing angle. SA-GCN can extract features efficiently by learning weights between joint points of different scales. Specifically, the proposed self-attention mechanism is end-to-end with mapping strategy for different nodes, which not only characterizes the multi-scale dependencies of joints, but also integrates the structural features of the graph and an ability of self-learning fusion features. Moreover, the attention mechanism proposed in this paper can be theoretically explained by GCN to some extent, which is usually not considered in most existing models. Extensive experiments on two widely used datasets, NTU-60 RGB+D and NTU-120 RGB+D, demonstrated that SA-GCN significantly outperforms a series of existing mainstream approaches in terms of accuracy.</p></abstract>
Read full abstract