Abstract

Despite the state-of-the-art video-based 3D human pose and shape estimation methods have achieved great progress in improving the temporal consistency of prediction results, there is a certain gap in the accuracy of 3D pose prediction on in- the-wild dataset compared with the state-of-the-art single image- based methods. The first reason is that these single image-based methods can benefit from in-the-wild single image datasets with high-precision 3D pseudo annotations, and the other one is the limitation of iterative error feedback (IEF) loop structure itself, which is a parameter regressor. In this paper, we propose a new neural network framework for 3D human pose and shape estimation from video. We first design two recurrent encoders based on Convolutional Gated Recurrent Unit Network (ConvGRU) to extract volumetric features with temporal information, and then a part attention mechanism is used to get final features, finally we use a regressor to obtain the Skinned Multi-Person Linear (SMPL) model parameters. We also design our method to support the hybrid training on video datasets and single image datasets, so that our method can benefit from these in-the-wild single image datasets. Experimental results show that our method outperforms state-of-the-art video-based methods on in-the-wild dataset (3DPW) without any fine-tuning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call