3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality

,Enas Kh Hassan,Computer Science Department, College Of Science, Mustansiriyah University, Baghdad, Iraq ,Jamila Harbi S

doi:10.54216/jisiot.100208

Abstract

In the realm of Human-Computer Interaction (HCI), the importance of hands cannot be overstated. Hands serve as a fundamental means of communication, expression, and interaction in the physical world. In recent years, Augmented Reality (AR) has emerged as a next-generation technology that seamlessly merges the digital and physical worlds, providing transformative experiences across various domains. In this context, accurate hand pose and shape estimation plays a crucial role in enabling natural and intuitive interactions within AR environments. Augmented Reality, with its ability to overlay digital information onto the real world, has the potential to revolutionize how we interact with technology. From gaming and education to healthcare and industrial training, AR has opened up new possibilities for enhancing user experiences. This study proposes an innovative approach for hand pose and shape estimation in AR applications. The methodology commences with the utilization of a pre-trained Single Shot Multi-Box (SSD) model for hand detection and cropping. The cropped hand image is then transformed into the HSV color model, followed by applying histogram equalization on the value band. To precisely isolate the hand, specific bounds are set for each band of the HSV color space, generating a mask. To refine the mask and diminish noise, contouring techniques are applied to the mask, and gap-filling methods are employed. The resultant refined mask is then combined with the original cropped image through logical AND operations to accurately delineate the hand boundaries. This meticulous approach ensures robust hand detection even in complex scenes. To extract pertinent features, the detected hand undergoes two concurrent processes. Firstly, the Scale-Invariant Feature Transform (SIFT) algorithm identifies distinctive keypoints on the hand's outer surface. Simultaneously, a pre-trained lightweight Convolutional Neural Network (CNN), namely MobileNet, is employed to extract 3D hand landmarks, the hand's center (middle finger metacarpophalangeal joint), and handedness information. These extracted features, encompassing hand keypoints, landmarks, center, and handedness, are aggregated and compiled into a CSV file for further analysis. A Gated Recurrent Unit (GRU) is then employed to process the features, capturing intricate dependencies between them. The GRU model successfully predicts the 3D hand pose, achieving high accuracy even in dynamic scenarios. The evaluation results for the proposed model are very promising that the Mean Per Joint Position Error in 3D (MPJPE) is 0.0596 between the predicted pose and the ground truth hand landmarks, while the Percentage of Correct Keypoints (PCK) is 95%. Upon predicting the hand pose, a mesh representation is employed to reconstruct the 3D shape of the hand. This mesh provides a tangible representation of the hand's structure and orientation, enhancing the realism and usability of the AR application. By integrating sophisticated detection, feature extraction, and predictive modeling techniques, this method contributes to creating more immersive and intuitive AR experiences, thereby fostering the seamless fusion of the digital and physical worlds.

Full Text