Hand Activity Recognition From Automatic Estimated Egocentric Skeletons Combining Slow Fast and Graphical Neural Networks

Viet-Duc Le,Van-Nam Hoang,Tien-Thanh Nguyen,Van-Hung Le,Hai Vu,Thi-Lan Le,Thanh-Hai Tran

doi:10.1142/s219688882250035x

Viet-Duc Le, Van-Nam Hoang + Show 5 more

Open Access

https://doi.org/10.1142/s219688882250035x

Copy DOI

Abstract

In this paper, we present an unified framework for understanding hand action from the first-person video. The proposed framework composes two main components. The first component estimates three-dimensional (3D) hand joints from RGB images. Two network structures derived from the baseline HopeNet network are proposed: convolutional neural networks (CNNs) which are traditional multi-layer CNN and CNN combining with GraphCNN to perform 3D hand pose estimation, without the use of GraphUNet as in baseline HopeNet method. The second component of the framework recognizes hand action from skeleton stream. We first deploy two recent advanced neuronal networks that are PA-ResGCN and Double-feature Double-motion (DDNet). To focus more on the hand pose changes, we improve DDNet with two normalization strategies of the hand joints. Finally, we fuse PA-ResGCN with our improved DDNet to still boost the recognition performance. We evaluate our proposed methods on First-Person Hand Action Benchmark dataset. Experiments show that our model for 3D hand joints estimation achieves the best precision (36.6 mm). Our hand joint normalization strategies improve the original DDNet from 0.71% to 4.05% of accuracy with the ground-truth hand pose while the improvement is significantly larger (from 2.96% to 10.98%) with the estimated hand pose. The late fusion schemes outperform different state-of-the-art methods for the hand action recognition with the highest accuracy of 86.67%. These experimental results show potential and extendable possibilities for developing practical first-person vision applications.

Full Text