Abstract

This work addresses a challenging problem of estimating the full 3D human shape and pose from monocular videos. Since real-world 3D mesh-labeled datasets are limited, most current methods in 3D human shape reconstruction only focus on single RGB images, losing all the temporal information. In contrast, we propose temporally refined Graph U-Nets, including an image-level module and a video-level module, to solve this problem. The image-level module is Graph U-Nets for human shape and pose estimation from images, where the Graph Convolutional Neural Network (Graph CNN) helps the information communication of neighboring vertices, and the U-Nets architecture enlarges the receptive field of each vertex and fuses high-level and low-level features. The video-level module is a small Residual Temporal Graph CNN (Residual TG-CNN), which learns temporal dynamics from both structural and temporal neighbors. The temporal dynamics of each vertex are continuous in the temporal dimension and highly relevant to the structural neighbors, so it is helpful to diminish the ambiguity of the body in single images by fusing temporal dynamics. Our algorithm makes full use of labels from image-level datasets and refines the image-level results through video-level module. Evaluated on Human3.6 M and 3DPW datasets, our model produces accurate 3D human meshes and achieves superior 3D human pose estimation accuracy when compared with state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call