Abstract

In recent years, skeleton-based action recognition, graph convolutional networks, have achieved remarkable performance. In these existing works, the features of all nodes in the neighbor set are aggregated into the updated features of the root node, while these features are located in the same feature channel determined by the same $1\times 1$ convolution filter. This may not be optimal for capturing the features of spatial dimensions among adjacent vertices effectively. Besides, the effect of feature channels that are independent of the current action on the performance of the model is rarely investigated in existing methods. In this paper, we propose cross-channel graph convolutional networks for skeleton-based action recognition. The features fusion mechanism in our network is cross-channel, i.e, the updated feature of the root node is derived from different feature channels. Because different feature channels come from different $1 \times 1$ convolution filters, the cross-channel fusion mechanism significantly improves the ability of the model to capture local features among adjacent vertices. Moreover, by introducing a channel attention mechanism to our model, we suppress the influence of feature channels unrelated to action recognition on model performance, which improves the robustness of the model against the feature channels independent of the current action. Extensive experiments on the two large-scale datasets, NTU-RGB+D and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the current mainstream methods.

Highlights

  • Human action recognition is still an important and challenging problem in computer vision and provides technical support for downstream applications such as video surveillance, human-machine interaction, video retrieval, and game-control

  • To solve the above problems, we propose a novel cross-channel graph convolutional networks (CC-Graph Convolutional Network (GCN)), which effectively suppresses the channels unrelated to the task of action recognition through the Channel Attention Mechanism (CAM)

  • The visualization process of features is divided into four steps: 1) A video clip is selected and converted into skeleton data through OpenPose; 2) The skeleton data is fed to the trained ST-GCN or BCC-GCN model for feature extraction; 3) The output of the ninth basic module is taken as the feature map in Fig.6; 4) The feature map is visualized by OpenCV

Read more

Summary

INTRODUCTION

Human action recognition is still an important and challenging problem in computer vision and provides technical support for downstream applications such as video surveillance, human-machine interaction, video retrieval, and game-control. We have investigated the reason why the previous models may not extract the local features of adjacent vertices in the graph effectively, i.e, these methods aggregate the features, in the same feature channel of all nodes in the neighbor set, as updated features of the root node. While these features are generated by the same 1 × 1 filter. On the two large scale datasets for skeleton-based action recognition, the proposed model achieves superior performance compared to current mainstream methods. Others are to compute the importance weights either for frames or different feature representations

BACKGROUND
ST-GCN
LIMITATION ANALYSIS OF PREVIOUS RELATED WORK
ENSEMBLE OF BONES AND JOINTS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call