The recovery of 3D interacting hands meshes in the wild (ITW) is crucial for 3D full-body mesh reconstruction, especially when limited 3D annotations are available. The recent ITW interacting hands recovery method brings two hands to a shared 2D scale space and achieves effective learning of ITW datasets. However, they lack the deep exploitation of the intrinsic interaction dynamics of hands. In this work, we propose TransWild, a novel framework for 3D interactive hand mesh recovery that leverages a weight-shared Intersection-of-Union (IoU) guided Transformer for feature interaction. Based on harmonizing ITW and MoCap datasets within a unified 2D scale space, our hand feature interaction mechanism powered by an IoU-guided Transformer enables a more accurate estimation of interacting hands. This innovation stems from the observation that hand detection yields valuable IoU of two hands bounding box, therefore, an IOU-guided Transformer can significantly enrich the Transformer’s ability to decode and integrate these insights into the interactive hand recovery process. To ensure consistent training outcomes, we have developed a strategy for training with augmented ground truth bounding boxes to address inherent variability. Quantitative evaluations across two prominent benchmarks for 3D interacting hands underscore our method’s superior performance. The code will be released after acceptance.
Read full abstract