Abstract

SummaryVideo‐based hand gesture recognition plays an important role in human‐computer interaction (HCI). Recent advanced methods usually add 3D convolutional neural networks to capture the information from both spatial and temporal dimensions. However, these methods suffer the issue of requiring large‐scale training data and high computational complexity. To address this issue, we proposed an attentive feature fusion framework for efficient hand‐gesture recognition. In our proposed model, we utilize a shallow two‐stream CNNs to capture the low‐level features from the original video frame and its corresponding optical flow. Following, we designed an attentive feature fusion module to selectively combine useful information from the previous two streams based on the attention mechanism. Finally, we obtain a compact embedding of a video by concatenating features from several short segments. To evaluate the effectiveness of our proposed framework, we train and test our method on a large‐scale video‐based hand gesture recognition dataset, Jester. Experimental results demonstrate that our approach obtains very competitive performance on the Jester dataset with a classification accuracy of 95.77%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call