Abstract

Skeleton-based human action recognition has attracted extensive attention due to the robustness of the human skeleton data in the field of computer vision. In recent years, there is a trend of using graph convolutional networks (GCNs) to model the human skeleton into a spatio-temporal graph to explore the internal connections of human joints that has achieved remarkable performance. However, the existing methods always ignore the remote dependency between joints, and fixed temporal convolution kernels will lead to inflexible temporal modeling. In this paper, we propose a multi-scale adaptive aggregate graph convolution network (MSAAGCN) for skeleton-based action recognition. First, we designed a multi-scale spatial GCN to aggregate the remote and multi-order semantic information of the skeleton data and comprehensively model the internal relations of the human body for feature learning. Then, the multi-scale temporal module adaptively selects convolution kernels of different temporal lengths to obtain a more flexible temporal map. Additionally, the attention mechanism is added to obtain more meaningful joint, frame and channel information in the skeleton sequence. Extensive experiments on three large-scale datasets (NTU RGB+D 60, NTU RGB+D 120 and Kinetics-Skeleton) demonstrate the superiority of our proposed MSAAGCN.

Highlights

  • With the development of Internet technology and the popularization of video acquisition equipment, video has become the main carrier of information

  • We propose a novel architecture for skeleton-based action recognition; the basic structure in our proposed model is the spatio-temporal-attention module (STAM), which is composed of a multi-scale aggregate graph convolution network (MSAGCN) module, Appl

  • We propose a novel architecture for skeleton-based action recognition; the 3boafs1i8c structure in our proposed model is the spatio-temporal-attention module (STAM), which is composed of a multi-scale aggregate graph convolution network (MSAGCN) module, aa mmuullttii--ssccaallee aaddaappttiivvee tteemmppoorraall ccoonnvvoolluuttiioonn nneettwwoorrkk ((MMSSAATTCCNN)) mmoodduullee aanndd aa ssppaattiioo-tteemmppoorraall--cchhaannnneell aatttteennttiioonn ((SSTTCCAAtttt)) mmoodduullee, aass sshhoowwnn iinn FFiigguurree 11

Read more

Summary

Introduction

With the development of Internet technology and the popularization of video acquisition equipment, video has become the main carrier of information. We propose a novel architecture for skeleton-based action recognition; the basic structure in our proposed model is the spatio-temporal-attention module (STAM), which is composed of a multi-scale aggregate graph convolution network (MSAGCN) module, Appl. We propose a novel architecture for skeleton-based action recognition; the 3boafs1i8c structure in our proposed model is the spatio-temporal-attention module (STAM), which is composed of a multi-scale aggregate graph convolution network (MSAGCN) module, aa mmuullttii--ssccaallee aaddaappttiivvee tteemmppoorraall ccoonnvvoolluuttiioonn nneettwwoorrkk ((MMSSAATTCCNN)) mmoodduullee aanndd aa ssppaattiioo-tteemmppoorraall--cchhaannnneell aatttteennttiioonn ((SSTTCCAAtttt)) mmoodduullee,, aass sshhoowwnn iinn FFiigguurree 11. The neurons in the same layer are independent of each other and are connected through cross-layers; it solves the common problems of gradient disappearance and explosion in traditional RNN and LSTM and can handle longer sequences and build a deeper network to learn the long-term dependencies in the skeleton sequence Both CNN-based and RNN-based methods ignore the co-occurrence between spatial and temporal features in the skeleton sequence because the skeleton data is embedded in non-Euclidean geometric space, and these methods cannot handle nonEuclidean data well. Yan et al [13] first proposed a spatio-temporal graph convolution network to directly model the skeleton sequence into a graph structure and construct graph convolution to extract spatio-temporal features for action recognition, which achieved better performance than previous methods

Graph Convolutional Network and Attention Mechanism
Multi-Scale Adaptive Aggregate Graph Convolutional Network
Multi-Scale Aggregate Graph Convolution Module
Multi-Scale Adaptive Temporal Convolution Module
Spatial-Temporal-Channel Attention Module
Multi-Stream Framework
Network Architecture
Implementation Details
MSAAGCN
Methods
Attention Mechanism
Comparison with State-of-the-Art Methods
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call