Estimating accurate 3D human poses from a monocular video is fundamental to various computer vision tasks. Existing methods exploit 2D-to-3D pose lifting, multiview images, and depth sensors to model spatio-temporal dependencies. However, depth ambiguities, occlusions, and larger temporal receptive fields pose challenges to these approaches. To address this, we propose a novel prior-free DCNN-based 3D human pose estimation method for monocular image sequences using limb vectors. Our method comprises two subnetworks: a limb direction estimator and a limb length estimator. The limb direction estimator utilizes a fully convolutional network to model limb direction vectors across a temporal window. We show that network complexity can be significantly reduced by utilizing dilated convolutional operations and a relatively smaller receptive field while maintaining estimation accuracy. Moreover, the limb length estimator captures stable limb length estimations from a reliable frame set. Our model has shown superior performance compared to existing methods on the Human3.6M and MPI-INF-3DHP datasets.
Read full abstract