In this paper, we address the problem of capturing both the shape and the pose of a human character using a single depth sensor. Some previous works proposed to fit a parametric generic human template into the depth image, while others developed deep learning (DL) approaches to find the correspondence between depth pixels and vertices of the template. We designed a hybrid approach, combining the advantages of both methods, and conducted extensive experiments on the SURREAL Varol et al. (2017), DFAUST datasets Bogo etal (2017) and a subset of AMASS Mahmood et al. (2019). Results show that this hybrid approach enables us to enhance pose and shape estimation compared to using DL or model fitting separately. We also evaluated the ability of the DL-based dense correspondence method to segment also the background — not only the body parts. We also evaluated 4 different methods to perform the model fitting based on a dense correspondence, where the number of available 3D points differs from the number of corresponding template vertices. These two results enabled us to better understand how to combine DL and model fitting, and the potential limits of this approach to deal with real-depth images. Future works could explore the potential of taking temporal information into account, which has proven to increase the accuracy of pose and shape reconstruction based on a unique depth or RGB image.
Read full abstract