Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

SVMAC: Unsupervised 3D Human Pose Estimation from a Single Image with Single-view-multi-angle Consistency

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Recovering 3D human pose from 2D joints is still a challenging problem, especially without any 3D annotation, video information, or multi-view information. In this paper, we present an unsupervised GAN-based model consisting of multiple weight-sharing generators to estimate a 3D human pose from a single image without 3D annotations. In our model, we introduce single-view-multi-angle consistency (SVMAC) to significantly improve the estimation performance. With 2Djoint locations as input, our model estimates a 3D pose and a camera simultaneously. During training, the estimated 3D pose is rotated by random angles and the estimated camera projects the rotated 3D poses back to 2D. The 2D reprojections will be fed into weight-sharing generators to estimate the corresponding 3D poses and cameras, which are then mixed to impose SVMAC constraints to self-supervise the training process. The experimental results show that our method outpetforms the state-of-the-art unsupervised methods on Human 3.6M and MPI-INF-3DHP. Moreover, qualitative results on MPII and LSP show that our method can generalize well to unknown data.

Similar Papers
  • Research Article
  • Cite Count Icon 13
  • 10.1109/tase.2023.3279928
Bi-Pose: Bidirectional 2D-3D Transformation for Human Pose Estimation From a Monocular Camera
  • Jul 1, 2024
  • IEEE Transactions on Automation Science and Engineering
  • Songlin Du + 3 more

Automatically estimating 3D human poses in video and inferring their meanings play an essential role in many human-centered automation systems. Existing researches made remarkable progresses by first estimating 2D human joints in video and then reconstructing 3D human pose from the 2D joints. However, mono-directionally reconstructing 3D pose from 2D joints ignores the interaction between information in 3D space and 2D space, losses rich information of original video, therefore limits the ceiling of estimation accuracy. To this end, this paper proposes a bidirectional 2D-3D transformation framework that bidirectionally exchanges 2D and 3D information and utilizes video information to estimate an offset for refining 3D human pose. In addition, a bone-length stability loss is utilized for the purpose of exploring human body structure to make the estimated 3D pose more natural and to further increase the overall accuracy. By evaluation, estimation error of the proposed method, measured by the mean per joint position error (MPJPE), is only 46.5 mm, which is much lower than state-of-the-art methods under the same experimental condition. The improvement on accuracy will make machines to better understand human poses for building superior human-centered automation systems. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —This paper was motivated by the demand of human-centered automation systems needing to accurately understand human poses. Existing approaches mainly focus on inferring 3D human pose from 2D joints mono-directionally. Although they made remarkable contributions to estimating 3D human pose in such a mono-directional way, we found that they ignore the 2D-3D interaction and do not use original video when inferring 3D pose from 2D joints. This paper therefore suggests a bidirectional 2D-3D transformation that exchanges 2D and 3D information and utilizes video information to estimate more accurate 3D human pose for human-centered automation systems. This work is a pioneering attempt of interactively using 2D and 3D information for more accurate estimation of human pose. Benefited from the state-of-the-art accuracy, the proposed approach is expected to make significant contributions to many human-centered automation systems, such as human-machine interaction, biomimetic manipulation, and automatic surveillance systems.

  • Research Article
  • Cite Count Icon 17
  • 10.1016/j.cag.2022.07.021
Enhancement of human 3D pose estimation using a novel concept of depth prediction with pose alignment from a single 2D image
  • Jul 26, 2022
  • Computers &amp; Graphics
  • Mohit Kushwaha + 2 more

Enhancement of human 3D pose estimation using a novel concept of depth prediction with pose alignment from a single 2D image

  • Research Article
  • Cite Count Icon 109
  • 10.1109/tpami.2019.2892452
3D Human Pose Machines with Self-Supervised Learning.
  • Jan 1, 2019
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Keze Wang + 4 more

Driven by recent computer vision and robotic applications, recovering 3D human poses has become increasingly important and attracted growing interests. In fact, completing this task is quite challenging due to the diverse appearances, viewpoints, occlusions and inherently geometric ambiguities inside monocular images. Most of the existing methods focus on designing some elaborate priors /constraints to directly regress 3D human poses based on the corresponding 2D human pose-aware features or 2D pose predictions. However, due to the insufficient 3D pose data for training and the domain gap between 2D space and 3D space, these methods have limited scalabilities for all practical scenarios (e.g., outdoor scene). Attempt to address this issue, this paper proposes a simple yet effective self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images. Specifically, the proposed mechanism involves two dual learning tasks, i.e., the 2D-to-3D pose transformation and 3D-to-2D pose projection, to serve as a bridge between 3D and 2D human poses in a type of "free" self-supervision for accurate 3D human pose estimation. The 2D-to-3D pose implies to sequentially regress intermediate 3D poses by transforming the pose representation from the 2D domain to the 3D domain under the sequence-dependent temporal context, while the 3D-to-2D pose projection contributes to refining the intermediate 3D poses by maintaining geometric consistency between the 2D projections of 3D poses and the estimated 2D poses. Therefore, these two dual learning tasks enable our model to adaptively learn from 3D human pose data and external large-scale 2D human pose data. We further apply our self-supervised correction mechanism to develop a 3D human pose machine, which jointly integrates the 2D spatial relationship, temporal smoothness of predictions and 3D geometric knowledge. Extensive evaluations on the Human3.6M and HumanEva-I benchmarks demonstrate the superior performance and efficiency of our framework over all the compared competing methods.

  • Conference Article
  • Cite Count Icon 66
  • 10.1109/3dv.2016.84
3D Human Pose Estimation via Deep Learning from 2D Annotations
  • Oct 1, 2016
  • Ernesto Brau + 1 more

We propose a deep convolutional neural network for 3D human pose and camera estimation from monocular images that learns from 2D joint annotations. The proposed network follows the typical architecture, but contains an additional output layer which projects predicted 3D joints onto 2D, and enforces constraints on body part lengths in 3D. We further enforce pose constraints using an independently trained network that learns a prior distribution over 3D poses. We evaluate our approach on several benchmark datasets and compare against state-of-the-art approaches for 3D human pose estimation, achieving comparable performance. Additionally, we show that our approach significantly outperforms other methods in cases where 3D ground truth data is unavailable, and that our network exhibits good generalization properties.

  • Research Article
  • Cite Count Icon 31
  • 10.1016/j.patcog.2023.109497
Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge
  • Mar 5, 2023
  • Pattern Recognition
  • Zhongwei Qiu + 3 more

Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge

  • Conference Article
  • Cite Count Icon 482
  • 10.1109/cvpr.2018.00551
3D Human Pose Estimation in the Wild by Adversarial Learning
  • Jun 1, 2018
  • Wei Yang + 5 more

Recently, remarkable advances have been achieved in 3D human pose estimation from monocular images because of the powerful Deep Convolutional Neural Networks (DCNNs). Despite their success on large-scale datasets collected in the constrained lab environment, it is difficult to obtain the 3D pose annotations for in-the-wild images. Therefore, 3D human pose estimation in the wild is still a challenge. In this paper, we propose an adversarial learning framework, which distills the 3D human pose structures learned from the fully annotated dataset to in-the-wild images with only 2D pose annotations. Instead of defining hard-coded rules to constrain the pose estimation results, we design a novel multi-source discriminator to distinguish the predicted 3D poses from the ground-truth, which helps to enforce the pose estimator to generate anthropometrically valid poses even with images in the wild. We also observe that a carefully designed information source for the discriminator is essential to boost the performance. Thus, we design a geometric descriptor, which computes the pairwise relative locations and distances between body joints, as a new information source for the discriminator. The efficacy of our adversarial learning framework with the new geometric descriptor has been demonstrated through extensive experiments on widely used public benchmarks. Our approach significantly improves the performance compared with previous state-of-the-art approaches.

  • Research Article
  • Cite Count Icon 14
  • 10.1007/s11263-023-01749-2
Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept
  • Feb 3, 2023
  • International Journal of Computer Vision
  • Qiang Nie + 2 more

Lifting the 2D human pose to the 3D pose is an important yet challenging task. Existing 3D human pose estimation suffers from (1) the inherent ambiguity between the 2D and 3D data, and (2) the lack of well-labeled 2D–3D pose pairs in the wild. Human beings are able to imagine the 3D human pose from a 2D image or a set of 2D body key-points with the least ambiguity, which should be attributed to the prior knowledge of the human body that we have acquired in our mind. Inspired by this, we propose a new framework that leverages the labeled 3D human poses to learn a 3D concept of the human body to reduce ambiguity. To have consensus on the body concept from the 2D pose, our key insight is to treat the 2D human pose and the 3D human pose as two different domains. By adapting the two domains, the body knowledge learned from 3D poses is applied to 2D poses and guides the 2D pose encoder to generate informative 3D “imagination” as an embedding in pose lifting. Benefiting from the domain adaptation perspective, the proposed framework unifies the supervised and semi-supervised 3D pose estimation in a principled framework. Extensive experiments demonstrate that the proposed approach can achieve state-of-the-art performance on standard benchmarks. More importantly, it is validated that the explicitly learned 3D body concept effectively alleviates the 2D–3D ambiguity, improves the generalization, and enables the network to leverage the abundant unlabeled 2D data.

  • Research Article
  • 10.1016/j.jer.2025.07.007
An efficient baseline for multi-view 3d human pose estimation
  • Aug 1, 2025
  • Journal of Engineering Research
  • Guozheng Peng + 1 more

An efficient baseline for multi-view 3d human pose estimation

  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.imavis.2025.105437
Markerless multi-view 3D human pose estimation: A survey
  • Mar 1, 2025
  • Image and Vision Computing
  • Ana Filipa Rodrigues Nogueira + 2 more

3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose. Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task. • First review only on multi-view, multi-modal methods to estimate 3D pose since 2012. • Multi-view allows capturing the full body geometry, making 3D pose estimation easier. • Real-world applications include sports, broadcasting, rehabilitation or animation. • Finding a fast, accurate method with low computational cost remains a challenge. • Multi-modal methods or view selection can lead to an efficient and effective model.

  • Research Article
  • Cite Count Icon 50
  • 10.1109/tvcg.2020.2973076
Weakly Supervised Adversarial Learning for 3D Human Pose Estimation from Point Clouds.
  • Feb 13, 2020
  • IEEE Transactions on Visualization and Computer Graphics
  • Zihao Zhang + 3 more

Point clouds-based 3D human pose estimation that aims to recover the 3D locations of human skeleton joints plays an important role in many AR/VR applications. The success of existing methods is generally built upon large scale data annotated with 3D human joints. However, it is a labor-intensive and error-prone process to annotate 3D human joints from input depth images or point clouds, due to the self-occlusion between body parts as well as the tedious annotation process on 3D point clouds. Meanwhile, it is easier to construct human pose datasets with 2D human joint annotations on depth images. To address this problem, we present a weakly supervised adversarial learning framework for 3D human pose estimation from point clouds. Compared to existing 3D human pose estimation methods from depth images or point clouds, we exploit both the weakly supervised data with only annotations of 2D human joints and fully supervised data with annotations of 3D human joints. In order to relieve the human pose ambiguity due to weak supervision, we adopt adversarial learning to ensure the recovered human pose is valid. Instead of using either 2D or 3D representations of depth images in previous methods, we exploit both point clouds and the input depth image. We adopt 2D CNN to extract 2D human joints from the input depth image, 2D human joints aid us in obtaining the initial 3D human joints and selecting effective sampling points that could reduce the computation cost of 3D human pose regression using point clouds network. The used point clouds network can narrow down the domain gap between the network input i.e. point clouds and 3D joints. Thanks to weakly supervised adversarial learning framework, our method can achieve accurate 3D human pose from point clouds. Experiments on the ITOP dataset and EVAL dataset demonstrate that our method can achieve state-of-the-art performance efficiently.

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.cviu.2023.103715
PoseGU: 3D human pose estimation with novel human pose generator and unbiased learning
  • May 13, 2023
  • Computer Vision and Image Understanding
  • Shannan Guan + 3 more

PoseGU: 3D human pose estimation with novel human pose generator and unbiased learning

  • Conference Article
  • Cite Count Icon 39
  • 10.1109/3dv50981.2020.00041
PoseNet3D: Learning Temporally Consistent 3D Human Pose via Knowledge Distillation
  • Nov 1, 2020
  • Shashank Tripathi + 3 more

Recovering 3D human pose from 2D joints is a highly unconstrained problem. We propose a novel neural network framework, PoseNet3D, that takes 2D joints as input and outputs 3D skeletons and SMPL body model parameters. By casting our learning approach in a student-teacher framework, we avoid using any 3D data such as paired/unpaired 3D data, motion capture sequences, depth images or multi-view images during training. We first train a teacher network that outputs 3D skeletons, using only 2D poses for training. The teacher network distills its knowledge to a student network that predicts 3D pose in SMPL representation. Finally, both the teacher and the student networks are jointly fine-tuned in an end-to-end manner using temporal, self-consistency and adversarial losses, improving the accuracy of each individual network. Results on Human3.6M dataset for 3D human pose estimation demonstrate that our approach reduces the 3D joint prediction error by 18% compared to previous unsupervised methods. Qualitative results on in-the-wild datasets show that the recovered 3D poses and meshes are natural, realistic, and flow smoothly over consecutive frames.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.cviu.2021.103278
Monocular 3D multi-person pose estimation via predicting factorized correction factors
  • Sep 28, 2021
  • Computer Vision and Image Understanding
  • Yu Guo + 4 more

Monocular 3D multi-person pose estimation via predicting factorized correction factors

  • Research Article
  • Cite Count Icon 29
  • 10.1109/tmm.2022.3158068
Quantification of Occlusion Handling Capability of a 3D Human Pose Estimation Framework
  • Jan 1, 2023
  • IEEE Transactions on Multimedia
  • Mehwish Ghafoor + 1 more

3D human pose estimation using monocular images is an important yet challenging task. Existing 3D pose detection methods exhibit excellent performance under normal conditions however their performance may degrade due to occlusion. Recently some occlusion aware methods have also been proposed however, the occlusion handling capability of these networks has not yet been thoroughly investigated. In the current work, we propose an occlusion-guided 3D human pose estimation framework and quantify its occlusion handling capability by using different protocols. The proposed method estimates more accurate 3D human poses using 2D skeletons with missing joints as input. Missing joints are handled by introducing occlusion guidance that provides extra information about the absence or presence of a joint. Temporal information has also been exploited to better estimate the missing joints. A large number of experiments are performed for the quantification of occlusion handling capability of the proposed method on three publicly available datasets in various settings including random missing joints, fixed body parts missing, and complete frames missing using mean per joint position error criterion. In addition to that, the quality of the predicted 3D poses is also evaluated using action classification performance as a criterion. 3D poses estimated by the proposed method achieved significantly improved action recognition performance in the presence of missing joints. Our experiments demonstrate the effectiveness of the proposed framework for handling the missing joints as well as quantification of the occlusion handling capability of the deep neural networks.

  • Conference Article
  • Cite Count Icon 38
  • 10.1109/cvpr52688.2022.00652
ElePose: Unsupervised 3D Human Pose Estimation by Predicting Camera Elevation and Learning Normalizing Flows on 2D Poses
  • Jun 1, 2022
  • Bastian Wandt + 2 more

Human pose estimation from single images is a challenging problem that is typically solved by supervised learning. Unfortunately, labeled training data does not yet exist for many human activities since 3D annotation requires dedicated motion capture systems. Therefore, we propose an unsupervised approach that learns to predict a 3D human pose from a single image while only being trained with 2D pose data, which can be crowd-sourced and is already widely available. To this end, we estimate the 3D pose that is most likely over random projections, with the likelihood estimated using normalizing flows on 2D poses. While previous work requires strong priors on camera rotations in the training data set, we learn the distribution of camera angles which significantly improves the performance. Another part of our contribution is to stabilize training with normalizing flows on high-dimensional 3D pose data by first projecting the 2D poses to a linear subspace. We outperform the state-of-the-art unsupervised human pose estimation methods on the benchmark datasets Human3.6M and MPI-INF-3DHP in many metrics.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant