Spatial–temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

Genbao Xu,Nan Ma,Mingxing Li,Zhixuan Wu,Cheng Xu,Cheng Wang

doi:10.1016/j.patcog.2024.110427

Abstract

For the problems of irrelevant frames and high model complexity in action recognition, we propose a Spatial–Temporal Hypergraph based on Dual-Stage Attention Network (STHG-DAN) for multi-view data lightweight action recognition. It includes two stages: Temporal Attention Mechanism based on Trainable Threshold (TAM-TT) and Hypergraph Convolution based on Dynamic Spatial–Temporal Attention Mechanism (HG-DSTAM). In the first stage, TAM-TT uses a learning threshold to extract keyframes from multi-view videos, with the multi-view data serving as a guarantee for providing more comprehensive information subsequently; In the second stage, HG-DSTAM divides the human joints into three parts: trunk, hand and leg to build spatial–temporal hypergraphs, extracts high-order features from spatial–temporal hypergraphs constructed of multi-view human body joints, inputs them into the dynamic spatial–temporal attention mechanism, and learns the intra frame correlation of multi-view data between the joint features of body parts, which can obtain the significant areas of action; We use multi-scale convolution operation and depth separable network, which can realize efficient action recognition with a few trainable parameters. We experiment on the NTU-RGB+D, NTU-RGB+D 120 and the imitating traffic police gesture dataset. The performance and accuracy of the model are better than the existing algorithms, effectively improving the machine and human body language interaction cognitive ability.

Full Text