Abstract

Despite recent successes in hand pose estimation from RGB images or depth maps, inherent challenges remain. RGB-based methods suffer from heavy self-occlusions and depth ambiguity. Depth sensors rely heavily on distance and can only be used indoors, thus there are many limitations to the practical application of depth-based methods. The aforementioned challenges have inspired us to combine the two modalities to offset the shortcomings of the other. In this paper, we propose a novel RGB and depth information fusion network to improve the accuracy of 3D hand pose estimation, which is called CrossFuNet. Specifically, the RGB image and the paired depth map are input into two different subnetworks, respectively. The feature maps are fused in the fusion module in which we propose a completely new approach to combine the information from the two modalities. Then, the common method is used to regress the 3D key-points by heatmaps. We validate our model on two public datasets and the results reveal that our model outperforms the state-of-the-art methods.

Highlights

  • We evaluated our approach on two public datasets: the Rendered Hand Dataset (RHD) [25] and the Stereo Hand Pose Tracking Benchmark (STB) [35]

  • We compared our approach with other methods including weakly-supervised methods (Cai et al [26] and Ge et al [28]), depth-guided methods (Chen et al [27]), and directly fused methods (Kazakos et al [29]) on the RHD and STB datasets to illustrate the effectiveness of our proposed methods, as shown in Table 1 The results of the proposed methods (Cai et al [26], Chen et al [27], and Ge et al [28]) on each datasets are obtained from their original papers

  • 3D hand pose regression based on a singleRGB image or depth map is more challenging, it loses some information

Read more

Summary

Introduction

Hand pose estimation is essential for the above HCI applications by capturing gestures from videos or images. It has been studied in computer vision for decades [1,2] and the research focus has shifted from 2D hand pose estimation to 3D. Recent methods focus on inputting single RGB images or depth maps to estimate 3D hand poses [3,4,5,6,7,8,9]. These methods provide satisfactory results in some cases with challenging single-frame input, 3D hand pose estimation still suffers from the inherent depth ambiguity of the monocular setting, self-occlusions, background noise, and serious affect by light

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call