Abstract

Extensive studies have been conducted on human action recognition, whereas relatively few methods have been proposed for hand action recognition. Although it is very natural and straightforward to apply a human action recognition method to hand action recognition, this approach cannot always lead to state-of-the-art performance. One of the important reasons is that both the between-class difference and the within-class difference in hand actions are much smaller than those in human actions. In this article, we study first-person hand action recognition from RGB-D sequences. To explore whether pretrained networks substantially influence accuracy, eight classic pretrained networks and one pretrained network designed by us are introduced for extracting RGB-D features. A Lie group is introduced for hand pose representation. Ablation studies are conducted to compare the discriminative power of the RGB modality, depth modality, pose modality, and their combinations. In our method, a fixed number of frames are randomly sampled to represent an action. This temporal modeling strategy is simple but is proven more effective than both the graph convolutional network (GCN) and the recurrent neural network (RNN), which are widely adopted by conventional methods. Evaluation experiments on two public data sets demonstrate that our method markedly outperforms recent baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call