3D Human Pose Estimation With Generative Adversarial Networks

Hailun Xia,Meng Xiao

doi:10.1109/access.2020.3037829

Abstract

3D human pose estimation from a monocular RGB image is a challenging task in computer vision because of depth ambiguity in a single RGB image. As most methods consider joint locations independently which can lead to an overfitting problem on specific datasets, it's crucial to consider the plausibility of 3D poses in terms of their overall structures. In this paper, we present Generative Adversarial Networks (GANs) for 3D human pose estimation, which learn plausible 3D human body representations by adversarial training. In GANs, the generator regresses 3D joint positions from a 2D input and the discriminator aims to distinguish the ground-truth 3D samples from the predicted ones. We leverage Graph Convolutional Networks (GCNs) in both generator and discriminator to fully exploit the spatial relations of input and output coordinates. The combination of GANs and GCNs promotes the network to predict more accurate 3D joint locations and learn more reasonable human body structures at the same time. We demonstrate the effectiveness of our approach on standard benchmark datasets (i.e. Human3.6M and HumanEva-I) where it outperforms state-of-the-art methods. Furthermore, we propose a new evaluation metric distance-based Pose Structure Score (dPSS) for evaluating the structural similarity degree between the predicted 3D pose and its ground-truth.

Highlights

Human pose estimation is widely applied in action recognition, behavior analysis, and human-computer interaction
We leverage Graph Convolutional Networks (GCNs) in both generator and discriminator to fully exploit the spatial relations of input and output coordinates, which further boost the performance for 3D human pose estimation
We propose distance-based Pose Structure Score, a new evaluation metric for 3D human pose estimation, which is fine-grained in evaluating structural similarity

Summary

Introduction

Human pose estimation is widely applied in action recognition, behavior analysis, and human-computer interaction. Though great progress has been made in 2D human pose estimation, 3D human pose estimation from a monocular RGB image is still a challenging task due to depth ambiguity. [1] proves that DCNNs can be employed in the task of monocular 3D pose estimation. Deep Convolutional Neural Networks(DCNNs) have made remarkable achievements in 3D Pose Estimation. They regress 3D joint locations from the image directly. Dense pixel information makes the location difficult to be regressed and the effect is not satisfactory. With the rapid development of deep learning, recent approaches solve this problem by using a two-step framework [2], [3]. Since 3D coordinates are difficult to annotate, The associate editor coordinating the review of this manuscript and approving it for publication was Jonghoon Kim

Objectives

Results

Conclusion