Abstract

Estimating accurate 3D hand pose from a single RGB image is a highly challenging problem in pose estimation due to self-geometric ambiguities, self-occlusions, and the absence of depth information. To this end, a novel Five-Layer Ensemble CNN (5LENet) is proposed based on hierarchical thinking, which is designed to decompose the hand pose estimation task into five single-finger pose estimation sub-tasks. Then, the sub-task estimation results are fused to estimate full 3D hand pose. The hierarchical method is of great benefit to extract deeper and better finger feature information, which can effectively improve the estimation accuracy of 3D hand pose. In addition, we also build a hand model with the center of the palm (represented as Palm) connected to the middle finger according to the topological structure of hand, which can further boost the performance of 3D hand pose estimation. Additionally, extensive quantitative and qualitative results on two public datasets demonstrate the effectiveness of 5LENet, yielding new state-of-the-art 3D estimation accuracy, which is superior to most advanced estimation methods.

Highlights

  • The gesture is among the most commonly used expressions by humans, and accurate 3D hand pose estimation has already become a key technology in the fields ofHuman-Computer Interaction (HCI) and Virtual Reality (VR) [1,2,3,4,5]

  • Mainstream 3D hand pose estimation methods can be classified into two categories: holistic estimation method based on the hand [12,13,14,15,16,17,18,19,20] and hierarchical estimation method based on hand structure [21,22,23,24,25,26]

  • The holistic estimation method based on the hand aims to directly use a complete hand structure for estimation, which has developed into a mainstream method in recent years [12,14]

Read more

Summary

Introduction

The gesture is among the most commonly used expressions by humans, and accurate 3D hand pose estimation has already become a key technology in the fields of. Spurr et al [14] trained encoder–decoder pairs from the generative perspective, which allows the estimation of full 3D hand pose from different input modalities These methods fail to make good use of hand structure and lose a high quantity of underlying information concerning hand structure. Chen et al [26] proposed a pose guided structured region ensemble network (Pose-REN) to estimate 3D hand pose hierarchically and iteratively These methods all exploit the underlying information of hand topology to successfully extract more representative hand features, thereby promoting more accurate hand pose estimation. Through effectively utilizing the structural characteristics of the hand to extract deeper and more representative finger feature information, this method can promote more accurate 2D finger pose estimation and can further optimize 3D finger pose estimation, and achieve the improvement of 3D hand pose estimation accuracy. We conduct experiments on the two public datasets, and results demonstrate that our approach achieves a new state-of-the-art in 3D hand pose prediction, which proves the effectiveness and advancement of 5LENet

Related Works
Overview
Localization and Segmentation Network
Hierarchical Ensemble Network
Hierarchical Estimation Network
Estimation Loss of 2D Pose
Loss of Hierarchical Estimation
Estimation Loss of 3D Pose
Total Loss of Network
OneHand10K
Evaluation Metrics
Experimental Details
Ablation Study
Effectiveness of Five-Layer Network
Effectiveness of Newly Added 3D Finger Pose Constraints
Effectiveness of Connecting Palm with a Single Finger
Effectiveness of Connecting Palm with Middle Finger
Comparison with the State-of-the-Art Methods
Qualitative Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call