Abstract

To accurately estimate 3D human pose from monocular camera images, a large amount of 3D annotated data is required. However, obtaining 3D annotated data outside the laboratory is not easy. In the absence of such data, weakly-supervised methods that rely on multi-view cameras during training and single-view cameras during inference have been proposed. These methods either use multi-view networks or classical triangulation to train the 3D human pose estimator. This study shows that these two paradigms can collaborate to further improve performance. The available unlabeled uncalibrated multi-view inputs are used to obtain pseudo-3D labels employing classical triangulation. A pose estimator is trained with these pseudo-3D labels and with multi-view re-projection loss. This loss enforces the 3D poses estimated from different views to be consistent and improves the performance. Therefore, our method relaxes the constraints (calibrated cameras, 2D/3D annotations), only requires multi-view videos for training, and is therefore convenient for in-the-wild settings. The proposed method outperforms previous works on two challenging datasets, Human3.6 M and MPI-INF-3DHP. Codes and pretrained models will be publicly available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.