Abstract

In this article, we propose a novel compressed latent distribution representation for 3D hand pose estimation from monocular RGB images to alleviate the channel correspondence problem. The channel correspondence problem occurs when the 2D and depth coordinates are estimated from independent feature maps, which means the 2D and depth channel sequences may not match during the cross-dataset inference. In contrast, we propose a compressed latent distribution representation that the 2D and depth feature maps for each joint are interconnected and inter-constrained more directly, effectively alleviating the channel correspondence problem and improving cross-dataset performance. Moreover, we design an efficient encoder-decoder network that can maintain the resolution of feature maps to enable better hand feature extraction from monocular RGB images. In this work, the overall pipeline contains two branches: one is the 2D hand pose estimation branch based on a latent heatmap representation (LHR); the other is the 3D hand pose estimation branch based on our proposed latent distribution representation (LDR). In this way, the 2D estimation branch serves as guidance for the 3D branch, which simplifies the optimization of the overall network and results in a more rapid convergence during training. The results on several benchmark datasets (including STB, RHD, and the most recently released InterHand2.6M) demonstrate that our proposed method achieves state-of-the-art (SOTA) performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call