Abstract

Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task.

Highlights

  • IntroductionApplications such as animation of a personalized hand in virtual reality (VR) and augmented reality (AR), handling objects [1] and in-air signature [2]

  • Estimating 3D hand shape and pose is very important for many computer vision (CV)applications such as animation of a personalized hand in virtual reality (VR) and augmented reality (AR), handling objects [1] and in-air signature [2]

  • Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task

Read more

Summary

Introduction

Applications such as animation of a personalized hand in virtual reality (VR) and augmented reality (AR), handling objects [1] and in-air signature [2]. This task is very challenging due to various factors including large variation in hand shapes, complex hand poses, many degrees of freedom and occlusions, especially in egocentric viewpoints. Direct hand pose regression methods (discriminative) [3,4,5] show the highest accuracy on public benchmarks. On the other hand, structured hand pose estimation methods either implicitly incorporate hand structure [7,8,9] or embed a kinematic hand model in a deep network [10,11,12]. The kinematic model parameterization is highly nonlinear, which is difficult to optimize in deep networks [13]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call