Perceiving and Predicting Semantic Keypoints

Duncan Zauss

doi:10.26083/tuprints-00019453

Abstract

In this work the inherently ambiguous task of predicting 3D human poses from monocular RGB images is tackled. Two different approaches to achieve this goal are presented. Firstly, it is proposed to train a fully connected neural network to lift the 2D joint positions, that can be obtained with any off-the-shelf 2D human pose estimation algorithm, to 3D poses. Since 3D human pose datasets are limited and the joint locations of datasets for 2D human pose estimation and 3D human pose estimation often do not match, we create a synthetic ground truth. Through this mean, our model can learn to lift arbitrary sets of keypoints to 3D. Our experiments show that we achieve competitive results on the Human3.6M without using any of the Human3.6M training data. Secondly, we propose a new fully convolutional architecture that encodes 3D poses with composite fields. Our method learns 3D vectors that point from a central position of the human body to all of the humans joints in the 3D space. Our model achieves competitive results on the challenging 3D poses in the wild dataset. Furthermore, our model runs at 21 FPS which makes it real-time capable.

Full Text