Detailed 3D human body reconstruction from multi-view images combining voxel super-resolution and learned implicit representation

Zhongguo Li,Anders Heyden,Magnus Oskarsson

doi:10.1007/s10489-021-02783-8

Abstract

The task of reconstructing detailed 3D human body models from images is interesting but challenging in computer vision due to the high freedom of human bodies. This work proposes a coarse-to-fine method to reconstruct detailed 3D human body from multi-view images combining Voxel Super-Resolution (VSR) based on learning the implicit representation. Firstly, the coarse 3D models are estimated by learning an Pixel-aligned Implicit Function based on Multi-scale Features (MF-PIFu) which are extracted by multi-stage hourglass networks from the multi-view images. Then, taking the low resolution voxel grids which are generated by the coarse 3D models as input, the VSR is implemented by learning an implicit function through a multi-stage 3D convolutional neural network. Finally, the refined detailed 3D human body models can be produced by VSR which can preserve the details and reduce the false reconstruction of the coarse 3D models. Benefiting from the implicit representation, the training process in our method is memory efficient and the detailed 3D human body produced by our method from multi-view images is the continuous decision boundary with high-resolution geometry. In addition, the coarse-to-fine method based on MF-PIFu and VSR can remove false reconstructions and preserve the appearance details in the final reconstruction, simultaneously. In the experiments, our method quantitatively and qualitatively achieves the competitive 3D human body models from images with various poses and shapes on both the real and synthetic datasets.

Highlights

Recovering detailed 3D human body models from images attracts much attention because of its wide applications in movie industry, animations, and Virtual/Augmented Reality
In this paper we propose a coarse-to-fine method for detailed 3D human body reconstruction from multiview images through learning an implicit representation
The coarse 3D models are estimated from multi-view images through learning pixel-aligned implicit function based on multi-scale features which encode both local and global information

Summary

Introduction

Recovering detailed 3D human body models from images attracts much attention because of its wide applications in movie industry, animations, and Virtual/Augmented Reality. With the developing of deep learning in 3D vsion, estimating 3D human bodies from common 2D images attracts much attention and has achieved some progress because it is much easier to obtain 2D images for the community. Reconstructing 3D human body from RGB images mainly depends on the pre-defined parametric human body models. The main idea of the route is to fit the parametric human body model to some prior information including the body skeleton, 2D joint points and the silhouettes [2, 6, 8]. The 3D human body models estimated by these methods cannot satisfy the requirements of the realism in many applications because the parametric models often do not encode the detailed appearance

Objectives

Methods

Results

Discussion

Conclusion