The regression based deep neural networks have achieved state-of-the-arts performance on depth 3D hand pose estimation task. This paper focuses on improving the regression mapping between features and pose joints. Inspired by the distribution modeling ability of Variational Autoencoders, we introduce an auxiliary variable into the regression network. During training, the auxiliary variable is modeled by an inference distribution that learns the underlying structural kinematics of human hand. Different with other regression methods on hand poses, our network estimates the pose joints from input depth features and the learned auxiliary variable as well. We show that by introducing the auxiliary variable, the regression is benefited from 1) regularization modeled by inference distribution; and 2) prior information carried by the auxiliary model. The effectiveness of the proposed regression method is evaluated with extensively self-comparative experiments and in comparison with other regression methods on hand pose datasets. The proposed network is easy to train in an end-to-end manner and can work with various feature extraction methods. We apply the proposed regression method to an existing hand pose estimation system, and improves the estimation accuracy by 18.35% and 16.65% on public hand pose datasets.
Read full abstract