Abstract

In this article we tackle the problem of hand pose estimation when the hand is interacting with various objects from egocentric viewpoint. This entails a frequent occlusion of parts of the hand by the object and also self-occlusions of the hand. We use a Voxel-to-Voxel approach to obtain hypotheses of the hand joint locations, ensemble the hypotheses and use several post-processing strategies to improve on the results. We utilize models of prior hand pose in the form of Truncated Singular Value Decomposition (SVD) and the temporal context to produce refined hand joint locations. We present an ablation study of the methods to show the influence of individual features of the post-processing. With our method we were able to constitute state-of-the-art results on the HANDS19 Challenge: Task 2 - Depth-Based 3D Hand Pose Estimation while Interacting with Objects, with precision on unseen test data of 33.09 mm.

Highlights

  • Devices capturing images or videos from first person view have recently become more common (e.g. Magic Leap One, Microsoft HoloLens, Google Glass)

  • We explore the effect of ensemble components and ensemble cardinality in a regression task at several levels to increase the estimation accuracy of the egocentric hand pose estimation task

  • To the best of our knowledge, we are the first ones to apply post-processing methods of a hand pose prior and temporal context modeled as an ensemble of TruncatedSVDs to the problem of hand pose estimation from egocentric viewpoint when interacting with objects

Read more

Summary

INTRODUCTION

Devices capturing images or videos from first person (egocentric) view have recently become more common (e.g. Magic Leap One, Microsoft HoloLens, Google Glass). The second approach called bottom-up is based on classification respectively regression of a given input hand image into a chosen representation of a hand model These machine learning approaches are mostly based on deep neural networks (mainly convolutional) and in principle need a large amount of training data to achieve satisfactory precision [21]–[28]. These individual approaches can be further divided based on the contextual temporal information: methods for detecting/tracking the hand from a single image and methods.

RELATED WORK
METHODS
HAND JOINTS LOCATION ESTIMATION
EXPERIMENTS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call