Deep Learning Methods for 3D Human Pose Estimation under Different Supervision Paradigms: A Survey

Dejun Zhang,Mingyue Guo,Yilin Chen,Yiqi Wu

doi:10.3390/electronics10182267

Abstract

The rise of deep learning technology has broadly promoted the practical application of artificial intelligence in production and daily life. In computer vision, many human-centered applications, such as video surveillance, human-computer interaction, digital entertainment, etc., rely heavily on accurate and efficient human pose estimation techniques. Inspired by the remarkable achievements in learning-based 2D human pose estimation, numerous research studies are devoted to the topic of 3D human pose estimation via deep learning methods. Against this backdrop, this paper provides an extensive literature survey of recent literature about deep learning methods for 3D human pose estimation to display the development process of these research studies, track the latest research trends, and analyze the characteristics of devised types of methods. The literature is reviewed, along with the general pipeline of 3D human pose estimation, which consists of human body modeling, learning-based pose estimation, and regularization for refinement. Different from existing reviews of the same topic, this paper focus on deep learning-based methods. The learning-based pose estimation is discussed from two categories: single-person and multi-person. Each one is further categorized by data type to the image-based methods and the video-based methods. Moreover, due to the significance of data for learning-based methods, this paper surveys the 3D human pose estimation methods according to the taxonomy of supervision form. At last, this paper also enlists the current and widely used datasets and compares performances of reviewed methods. Based on this literature survey, it can be concluded that each branch of 3D human pose estimation starts with fully-supervised methods, and there is still much room for multi-person pose estimation based on other supervision methods from both image and video. Besides the significant development of 3D human pose estimation via deep learning, the inherent ambiguity and occlusion problems remain challenging issues that need to be better addressed.

Highlights

Human pose estimation is an important and widely concerned research topic in computer vision
The main challenges in pose estimation based on deep learning method remain to be solved. (i) Lack of 3D training data: Since 3D manual annotation is expensive and timeconsuming, there are not many 3D training data paired with 3D annotation
We briefly describe the mentioned models which could be employed in some approaches that are studied in this paper below

Summary

Introduction

Human pose estimation is an important and widely concerned research topic in computer vision. Given an image or video input, 3D human pose estimation aims to predict the configuration of the human body. (ii) Depth ambiguity: Depth ambiguity is an ill-posed problem in estimating 3D pose This problem occurs because, despite given different 3D depths, joints of different poses may be projected to the same 2D location. Some works investigate this topic in different ways (e.g., References [16,17,18]). (iii) Occlusions: For single-person 3D pose estimation, self-occlusions (e.g., body parts occlusions) could greatly affect the performance of predicting 3D joint locations. There are plenty works aiming at overcoming this obstacle (e.g., References [19,20])

Methods

Results

Conclusion