Cross-Channel Graph Convolutional Networks for Skeleton-Based Action Recognition

Jun Xie,Ruyi Liu,Lei Tang,Xiangzeng Liu,Xuesong Gao,Wentian Xin,Qiguang Miao,Sheng Zhong,Lijie Sheng

doi:10.1109/access.2021.3049808

Abstract

In recent years, skeleton-based action recognition, graph convolutional networks, have achieved remarkable performance. In these existing works, the features of all nodes in the neighbor set are aggregated into the updated features of the root node, while these features are located in the same feature channel determined by the same $1\times 1$ convolution filter. This may not be optimal for capturing the features of spatial dimensions among adjacent vertices effectively. Besides, the effect of feature channels that are independent of the current action on the performance of the model is rarely investigated in existing methods. In this paper, we propose cross-channel graph convolutional networks for skeleton-based action recognition. The features fusion mechanism in our network is cross-channel, i.e, the updated feature of the root node is derived from different feature channels. Because different feature channels come from different $1 \times 1$ convolution filters, the cross-channel fusion mechanism significantly improves the ability of the model to capture local features among adjacent vertices. Moreover, by introducing a channel attention mechanism to our model, we suppress the influence of feature channels unrelated to action recognition on model performance, which improves the robustness of the model against the feature channels independent of the current action. Extensive experiments on the two large-scale datasets, NTU-RGB+D and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the current mainstream methods.

Highlights

Human action recognition is still an important and challenging problem in computer vision and provides technical support for downstream applications such as video surveillance, human-machine interaction, video retrieval, and game-control
To solve the above problems, we propose a novel cross-channel graph convolutional networks (CC-Graph Convolutional Network (GCN)), which effectively suppresses the channels unrelated to the task of action recognition through the Channel Attention Mechanism (CAM)
The visualization process of features is divided into four steps: 1) A video clip is selected and converted into skeleton data through OpenPose; 2) The skeleton data is fed to the trained ST-GCN or BCC-GCN model for feature extraction; 3) The output of the ninth basic module is taken as the feature map in Fig.6; 4) The feature map is visualized by OpenCV

Summary

INTRODUCTION

Human action recognition is still an important and challenging problem in computer vision and provides technical support for downstream applications such as video surveillance, human-machine interaction, video retrieval, and game-control. We have investigated the reason why the previous models may not extract the local features of adjacent vertices in the graph effectively, i.e, these methods aggregate the features, in the same feature channel of all nodes in the neighbor set, as updated features of the root node. While these features are generated by the same 1 × 1 filter. On the two large scale datasets for skeleton-based action recognition, the proposed model achieves superior performance compared to current mainstream methods. Others are to compute the importance weights either for frames or different feature representations

BACKGROUND

ST-GCN

LIMITATION ANALYSIS OF PREVIOUS RELATED WORK

ENSEMBLE OF BONES AND JOINTS

Findings

CONCLUSION