Although 3D human pose estimation has recently made strides, it is still difficult to precisely recreate a 3D human posture from a single image without the aid of 3D annotation for the following reasons. Firstly, the process of reconstruction inherently suffers from ambiguity, as multiple 3D poses can be projected onto the same 2D pose. Secondly, accurately measuring camera rotation without laborious camera calibration is a difficult task. While some approaches attempt to address these issues using traditional computer vision algorithms, they are not differentiable and cannot be optimized through training. This paper introduces two modules that explicitly leverage geometry to overcome these challenges, without requiring any 3D ground-truth or camera parameters. The first module, known as the relative depth estimation module, effectively mitigates depth ambiguity by narrowing down the possible depths for each joint to only two candidates. The second module, referred to as the differentiable pose alignment module, calculates camera rotation by aligning poses from different views. The use of these geometrically interpretable modules reduces the complexity of training and yields superior performance. By adopting our proposed method, we achieve state-of-the-art results on standard benchmark datasets, surpassing other self-supervised methods and even outperforming several fully-supervised approaches that heavily rely on 3D annotations.
Read full abstract