Abstract
Estimating 3D hand pose from a single RGB image is a challenging task because of its ill-posed nature (i.e., depth ambiguity). Recently, various generative approaches have been proposed to predict the 3D joints of an RGB hand image by learning a unified latent space between two modalities (i.e., RGB image and 3D joints). However, projecting multi-modal data (i.e., RGB images and 3D joints) into a unified latent space is difficult as the modality-specific features usually interfere the learning of the optimal latent space. Hence in this paper, we propose to disentangle the latent space into two sub-latent spaces: modality- specific latent space and pose-specific latent space for 3D hand pose estimation. Our proposed method, namely Disentangled Cross-Modal Latent Space (DCMLS), consists of two variational autoencoder networks and auxiliary components which connect the two VAEs to align underlying hand poses and transfer modality-specific context from RGB to 3D. For the hand pose latent space, we align it with the two modalities by using a cross-modal discriminator with an adversarial learning strategy. For the context latent space, we learn a context translator to gain access to the cross-modal context. Experimental results on two widely used public benchmark datasets RHD and STB demonstrate that our proposed DCMLS method is able to clearly outperform the state-of-the-art ones on single image based 3D hand pose estimation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.