Abstract

In this paper, we present an unified framework for understanding hand action from the first-person video. The proposed framework composes two main components. The first component estimates three-dimensional (3D) hand joints from RGB images. Two network structures derived from the baseline HopeNet network are proposed: convolutional neural networks (CNNs) which are traditional multi-layer CNN and CNN combining with GraphCNN to perform 3D hand pose estimation, without the use of GraphUNet as in baseline HopeNet method. The second component of the framework recognizes hand action from skeleton stream. We first deploy two recent advanced neuronal networks that are PA-ResGCN and Double-feature Double-motion (DDNet). To focus more on the hand pose changes, we improve DDNet with two normalization strategies of the hand joints. Finally, we fuse PA-ResGCN with our improved DDNet to still boost the recognition performance. We evaluate our proposed methods on First-Person Hand Action Benchmark dataset. Experiments show that our model for 3D hand joints estimation achieves the best precision (36.6 mm). Our hand joint normalization strategies improve the original DDNet from 0.71% to 4.05% of accuracy with the ground-truth hand pose while the improvement is significantly larger (from 2.96% to 10.98%) with the estimated hand pose. The late fusion schemes outperform different state-of-the-art methods for the hand action recognition with the highest accuracy of 86.67%. These experimental results show potential and extendable possibilities for developing practical first-person vision applications.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call