Abstract

The application of dynamic gestures is extensive in the field of automated intelligent manufacturing. Due to the temporal and spatial complexity of dynamic gesture data, traditional machine learning algorithms struggle to extract accurate gesture features. Existing dynamic gesture recognition algorithms have complex network designs, high parameter counts, and inadequate gesture feature extraction. In order to solve the problems of low accuracy and high computational complexity in current dynamic gesture recognition, a network model based on the MetaFormer architecture and an attention mechanism was designed. The proposed network fuses a CNN (convolutional neural network) and Transformer model by embedding spatial attention convolution and temporal attention convolution into the Transformer model. Specifically, the token mixer in the MetaFormer block is replaced by the Spatial Attention Convolution Block and Temporal Attention Convolution Block to obtain the Spatial Attention Former Block and Temporal Attention Former Block. Firstly, each frame of the input image is quickly down-sampled by the PoolFormer block and then input to the Spatial Attention Former Block to learn spatial feature information. Then, the spatial feature maps learned from each frame are concatenated along the channel dimension and input to the Temporal Attention Former Block to learn the temporal feature information of the gesture action. Finally, the learned overall feature information is classified to obtain the category of dynamic gestures. The model achieves an average recognition accuracy of 96.72% and 92.16% on two publicly available datasets, Jester and NVGesture, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call