In this paper, we tackle the problem of detailed hand-object 3D reconstruction from monocular video with unknown objects, for applications where the required accuracy and level of detail is important, e.g. object hand-over in human–robot collaboration, or manipulation and contact point analysis. While the recent literature on this topic is promising, the accuracy and generalization abilities of existing methods are still lacking. This is due to several limitations, such as the assumption of known object class or model for a small number of instances, or over-reliance on off-the-shelf keypoint and structure-from-motion methods for object-relative viewpoint estimation, prone to complete failure with previously unobserved, poorly textured objects or hand-object occlusions. To address previous method shortcomings, we present a 2-stage pipeline superseding state-of-the-art (SotA) performance on several metrics. First, we robustly retrieve viewpoints relying on a learned pairwise camera pose estimator trainable with a low data regime, followed by a globalized Shonan pose averaging. Second, we simultaneously estimate detailed 3D hand-object shapes and refine camera poses using a differential renderer-based optimizer. To better assess the out-of-distribution abilities of existing methods, and to showcase our methodological contributions, we introduce the new SHOWMe benchmark dataset with 96 sequences annotated with poses, millimetric textured 3D shape scans, and parametric hand models, introducing new object and hand diversity. Remarkably, we show that our method is able to reconstruct 100% of these sequences as opposed to SotA Structure-from-Motion (SfM) or hand-keypoint-based pipelines, and obtains reconstructions of equivalent or better precision when existing methods do succeed in providing a result. We hope these contributions lead to further research under harder input assumptions. The dataset can be downloaded at https://download.europe.naverlabs.com/showme.
Read full abstract