Abstract

With the development of cost-effective depth sensors, skeleton-based dynamic hand gesture recognition has made significant progress. Existing methods mostly utilize a single model to learn all spatial–temporal features. Meanwhile, they cannot effectively boost key features and make use of multi-scale features. In this paper, we propose a lightweight dual-stream framework, which consists of a temporal mutual boosted stream (TMB-Stream) and a spatial self-boosted stream (SSB-Stream). In the TMB-Stream, we design a hybrid attention module (HAM) to boost important motion features from temporal sequences, which is composed of a multi-scale multi-head attention module (MMAM) and a spatial–temporal attention module (STAM). In the SSB-Stream, we present a self-boosted learning manner to promote the performance of the spatial stream. Specifically, we design a multi-scale auto-encoder (MAE), which can use limited skeleton data to extract and boost spatial latent features by minimizing the gap between original and reconstructed skeleton images. In addition, we propose a multi-scale fusion module (MFM) to effectively fuse multi-scale features in stages. Experimental results show that our lightweight framework yields satisfactory performance on SHREC’17 Track and DHG-14/28 datasets, as well as very competitive performance on FPHA dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call