Hand gesture recognition is essential to human-computer interaction as the most natural way of communicating. Furthermore, with the development of 3D hand pose estimation technology and the performance improvement of low-cost depth cameras, skeleton-based dynamic hand gesture recognition has received much attention. This paper proposes a novel multi-stream improved spatio-temporal graph convolutional network (MS-ISTGCN) for skeleton-based dynamic hand gesture recognition. We adopt an adaptive spatial graph convolution that can learn the relationship between distant hand joints and propose an extended temporal graph convolution with multiple dilation rates that can extract informative temporal features from short to long periods. Furthermore, we add a new attention layer consisting of effective spatio-temporal attention and channel attention between the spatial and temporal graph convolution layers to find and focus on key features. Finally, we propose a multi-stream structure that feeds multiple data modalities (i.e., joints, bones, and motions) as inputs to improve performance using the ensemble technique. Each of the three-stream networks is independently trained and fused to predict the final hand gesture. The performance of the proposed method is verified through extensive experiments with two widely used public dynamic hand gesture datasets: SHREC’17 Track and DHG-14/28. Our proposed method achieves the highest recognition accuracy in various gesture categories for both datasets compared with state-of-the-art methods.
Read full abstract