Improving the Hand Pose Estimation from Egocentric Vision via HOPE-Net and Mask R-CNN

Sinh-Huy Nguyen,Hoang-Bach Nguyen,Chi-Thanh Nguyen,Hai Vu,Thanh-Tung Phan,Thi-Thu-Hong Le

doi:10.1109/mapr56351.2022.9924768

Abstract

Hand pose estimation is the task of predicting the position and orientation of the hand and fingers relative to some coordinate system. It is an important task or input for applications in robotics, medical or human-computer interaction. In recent years, the success of deep convolutional neural networks and the popularity of low-cost consumer wearable cameras have made hand pose estimation on egocentric images using deep neural networks a hot topic in the computer vision field. This paper proposes a novel deep model for accurate 2D hand pose estimation that combines HOPE-Net, which estimates hand pose, and Mask R-CNN, which provides hand detection and segmentation to localize the hand in the image. First, HOPENet is used to predict the initial 2D hand pose, and the hand features are extracted from an image with a hand in the center, which is cropped from the original image based on Mask RCNN’s output. Then, we combine the initial 2D hand pose and the hand features into a fully connected layer to predict the 2D hand pose correctly. Our experiments show that the proposed model outperforms the original HOPE-Net in 2D hand pose estimation. The proposed method’s mean endpoint error (mEPE) is 48.82 pixels, while the mEPE of the 2D HOPE-Net predictor is 86.30 pixels on the First-Person Hand Action dataset.

Full Text