Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

View Invariant 3D Human Pose Estimation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The recent success of neural networks has significantly advanced the performance of 3D human pose estimation from 2D input images. However, the diversity of capturing viewpoints and the flexibility of the human poses remain some significant challenges. In this paper, we propose a view-invariant 3D human pose estimation module to alleviate the effects of viewpoint diversity. The proposed framework consists of a base network, which provides an initial estimation of a 3D pose, a view-invariant hierarchical correction network (VI-HC) on top of that to learn the 3D pose refinement under consistent views, and a view-invariant discriminative network (VID) to enforce high-level constraints over body configurations. In VI-HC, the initial 3D pose inputs are automatically transformed to consistent views for further refinements at the global body and local body parts level, respectively. For the VID, under consistent viewpoints, we use adversarial learning to differentiate between estimated 3D poses and real 3D poses to avoid implausible results. The experimental results demonstrate that the constraint on viewpoint consistency can dramatically enhance the performance of 3D human pose estimation. Our module shows robustness for different 3D pose base networks and achieves a significant improvement (about 9%) over a powerful baseline on the public 3D pose estimation benchmark Human3.6M.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1109/access.2020.3013917
An Adaptive Viewpoint Transformation Network for 3D Human Pose Estimation
  • Jan 1, 2020
  • IEEE Access
  • Guoqiang Liang + 3 more

Human pose estimation from a monocular image has attracted lots of interest due to its huge potential application in many areas. The performance of 2D human pose estimation has been improved a lot with the emergence of deep convolutional neural network. In contrast, the recovery of 3D human pose from an 2D pose is still a challenging problem. Currently, most of the methods try to learn a universal map, which can be applied for all human poses in any viewpoints. However, due to the large variety of human poses and camera viewpoints, it is very difficult to learn a such universal mapping from current datasets for 3D pose estimation. Instead of learning a universal map, we propose to learn an adaptive viewpoint transformation module, which transforms the 2D human pose to a more suitable viewpoint for recovering the 3D human pose. Specifically, our transformation module takes a 2D pose as input and predicts the transformation parameters. Rather than some hand-crafted criteria, this module is directly learned from the datasets and depends on the input 2D pose in testing phrase. Then the 3D pose is recovered from this transformed 2D pose. Since the difficulty of 3D pose recovery becomes smaller, we can obtain more accurate estimation results. Experiments on Human3.6M and MPII datasets show that the proposed adaptive viewpoint transformation can improve the performance of 3D human pose estimation.

  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.imavis.2025.105437
Markerless multi-view 3D human pose estimation: A survey
  • Mar 1, 2025
  • Image and Vision Computing
  • Ana Filipa Rodrigues Nogueira + 2 more

3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose. Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task. • First review only on multi-view, multi-modal methods to estimate 3D pose since 2012. • Multi-view allows capturing the full body geometry, making 3D pose estimation easier. • Real-world applications include sports, broadcasting, rehabilitation or animation. • Finding a fast, accurate method with low computational cost remains a challenge. • Multi-modal methods or view selection can lead to an efficient and effective model.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.media.2024.103208
A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation
  • May 18, 2024
  • Medical Image Analysis
  • Wang Yin + 9 more

A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.cviu.2023.103715
PoseGU: 3D human pose estimation with novel human pose generator and unbiased learning
  • May 13, 2023
  • Computer Vision and Image Understanding
  • Shannan Guan + 3 more

PoseGU: 3D human pose estimation with novel human pose generator and unbiased learning

  • Research Article
  • Cite Count Icon 31
  • 10.1016/j.patcog.2023.109497
Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge
  • Mar 5, 2023
  • Pattern Recognition
  • Zhongwei Qiu + 3 more

Weakly-supervised pre-training for 3D human pose estimation via perspective knowledge

  • Research Article
  • Cite Count Icon 34
  • 10.1007/s11263-018-1071-9
Image-Based Synthesis for Deep 3D Human Pose Estimation
  • Mar 19, 2018
  • International Journal of Computer Vision
  • Grégory Rogez + 1 more

This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage.

  • Research Article
  • Cite Count Icon 22
  • 10.1016/j.cviu.2018.02.004
2D–3D pose consistency-based conditional random fields for 3D human pose estimation
  • Feb 9, 2018
  • Computer Vision and Image Understanding
  • Ju Yong Chang + 1 more

2D–3D pose consistency-based conditional random fields for 3D human pose estimation

  • Research Article
  • Cite Count Icon 14
  • 10.1007/s11263-023-01749-2
Lifting 2D Human Pose to 3D with Domain Adapted 3D Body Concept
  • Feb 3, 2023
  • International Journal of Computer Vision
  • Qiang Nie + 2 more

Lifting the 2D human pose to the 3D pose is an important yet challenging task. Existing 3D human pose estimation suffers from (1) the inherent ambiguity between the 2D and 3D data, and (2) the lack of well-labeled 2D–3D pose pairs in the wild. Human beings are able to imagine the 3D human pose from a 2D image or a set of 2D body key-points with the least ambiguity, which should be attributed to the prior knowledge of the human body that we have acquired in our mind. Inspired by this, we propose a new framework that leverages the labeled 3D human poses to learn a 3D concept of the human body to reduce ambiguity. To have consensus on the body concept from the 2D pose, our key insight is to treat the 2D human pose and the 3D human pose as two different domains. By adapting the two domains, the body knowledge learned from 3D poses is applied to 2D poses and guides the 2D pose encoder to generate informative 3D “imagination” as an embedding in pose lifting. Benefiting from the domain adaptation perspective, the proposed framework unifies the supervised and semi-supervised 3D pose estimation in a principled framework. Extensive experiments demonstrate that the proposed approach can achieve state-of-the-art performance on standard benchmarks. More importantly, it is validated that the explicitly learned 3D body concept effectively alleviates the 2D–3D ambiguity, improves the generalization, and enables the network to leverage the abundant unlabeled 2D data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1371/journal.pone.0264302
LHPE-nets: A lightweight 2D and 3D human pose estimation model with well-structural deep networks and multi-view pose sample simplification method
  • Feb 23, 2022
  • PLoS ONE
  • Hao Wang + 3 more

The cross-view 3D human pose estimation model has made significant progress, it better completed the task of human joint positioning and skeleton modeling in 3D through multi-view fusion method. The multi-view 2D pose estimation part of this model is very important, but its training cost is also very high. It uses some deep learning networks to generate heatmaps for each view. Therefore, in this article, we tested some new deep learning networks for pose estimation tasks. These deep networks include Mobilenetv2, Mobilenetv3, Efficientnetv2 and Resnet. Then, based on the performance and drawbacks of these networks, we built multiple deep learning networks with better performance. We call our network in this article LHPE-nets, which mainly includes Low-Span network and RDNS network. LHPE-nets uses a network structure with evenly distributed channels, inverted residuals, external residual blocks and a framework for processing small-resolution samples to achieve training saturation faster. And we also designed a static pose sample simplification method for 3D pose data. It implemented low-cost sample storage, and it was also convenient for models to read these samples. In the experiment, we used several recent models and two public estimation indicators. The experimental results show the superiority of this work in fast start-up and network lightweight, it is about 1-5 epochs faster than the Resnet-34 during training. And they also show the accuracy improvement of this work in estimating different joints, the estimated performance of approximately 60% of the joints is improved. Its performance in the overall human pose estimation exceeds other networks by more than 7mm. The experiment analyzes the network size, fast start-up and the performance in 2D and 3D pose estimation of the model in this paper in detail. Compared with other pose estimation models, its performance has also reached a higher level of application.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icpr48806.2021.9412348
A Multi-Task Neural Network for Action Recognition with 3D Key-Points
  • Jan 10, 2021
  • Rongxiao Tang + 2 more

Action recognition and 3D human pose estimation are fundamental problems in computer vision and closely related areas. In this work, we propose a multi-task neural network for action recognition and 3D human pose estimation. Results of previous methods are usually error-prone especially when tested against the images taken in-the-wild, leading error results in action recognition. To solve this problem, we propose a principled approach to generate high quality 3D pose ground truth given any in-the-wild image with a person inside. We achieve this by first devising a novel stereo inspired neural network to directly map any 2D pose to high quality 3D counterpart. Based on the high-quality 3D labels, we carefully design the multi-task framework for action recognition and 3D human pose estimation. The proposed architecture can utilize shallow, deep features of images, and in-the-wild 3D human key-points to guide a more precise result. High quality 3D key-points can fully reflect morphological features of motions, thus boost the performance on action recognition. Experimental results demonstrate that 3D pose estimation leads to significantly higher performance on action recognition than separated learning. We also evaluate the generalization ability of our method both quantitatively and qualitatively. The proposed architecture performs favorably against the baseline 3D pose estimation methods. In addition, the reported results on Penn Action and NTU datasets demonstrate the effectiveness of our method on the action recognition task.

  • Research Article
  • Cite Count Icon 4
  • 10.1049/iet-cvi.2019.0089
3D driver pose estimation based on joint 2D–3D network
  • Jan 29, 2020
  • IET Computer Vision
  • Zhijie Yao + 5 more

Three‐dimensional (3D) driver pose estimation is a promising and challenging problem for computer–human interaction. Recently convolutional neural networks have been introduced into 3D pose estimation, but these methods have the problem of slow running speed and are not suitable for driving scenario. In this study, the proposed method is based on two types of inputs, infrared image and point cloud obtained from time‐of‐flight camera. The authors propose a joint 2D–3D network incorporating image‐based and point‐based feature to promote the performance of 3D human pose estimation and run on a high speed. For point cloud with invalid points, the authors first do preprocess and then design a denoising module to handle this problem. Experiments on private driver data set and public Invariant‐Top View data set show that the proposed method achieves efficient and competitive performance on 3D human pose estimation.

  • Book Chapter
  • Cite Count Icon 506
  • 10.1007/978-3-030-01249-6_5
Exploiting Temporal Information for 3D Human Pose Estimation
  • Jan 1, 2018
  • Mir Rayat Imtiaz Hossain + 1 more

In this work, we address the problem of 3D human pose estimation from a sequence of 2D human poses. Although the recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly, the top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from images and then mapping them into 3D space. They also showed that a low-dimensional representation like 2D locations of a set of joints can be discriminative enough to estimate 3D pose with high accuracy. However, estimation of 3D pose for individual frames leads to temporally incoherent estimates due to independent error in each frame causing jitter. Therefore, in this work we utilize the temporal information across a sequence of 2D joint locations to estimate a sequence of 3D poses. We designed a sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training. We found that the knowledge of temporal consistency improves the best reported result on Human3.6M dataset by approximately $12.2\%$ and helps our network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails.

  • Research Article
  • Cite Count Icon 29
  • 10.1109/tmm.2022.3158068
Quantification of Occlusion Handling Capability of a 3D Human Pose Estimation Framework
  • Jan 1, 2023
  • IEEE Transactions on Multimedia
  • Mehwish Ghafoor + 1 more

3D human pose estimation using monocular images is an important yet challenging task. Existing 3D pose detection methods exhibit excellent performance under normal conditions however their performance may degrade due to occlusion. Recently some occlusion aware methods have also been proposed however, the occlusion handling capability of these networks has not yet been thoroughly investigated. In the current work, we propose an occlusion-guided 3D human pose estimation framework and quantify its occlusion handling capability by using different protocols. The proposed method estimates more accurate 3D human poses using 2D skeletons with missing joints as input. Missing joints are handled by introducing occlusion guidance that provides extra information about the absence or presence of a joint. Temporal information has also been exploited to better estimate the missing joints. A large number of experiments are performed for the quantification of occlusion handling capability of the proposed method on three publicly available datasets in various settings including random missing joints, fixed body parts missing, and complete frames missing using mean per joint position error criterion. In addition to that, the quality of the predicted 3D poses is also evaluated using action classification performance as a criterion. 3D poses estimated by the proposed method achieved significantly improved action recognition performance in the presence of missing joints. Our experiments demonstrate the effectiveness of the proposed framework for handling the missing joints as well as quantification of the occlusion handling capability of the deep neural networks.

  • Research Article
  • Cite Count Icon 41
  • 10.1016/j.cviu.2018.03.007
A dual-source approach for 3D human pose estimation from single images
  • Apr 4, 2018
  • Computer Vision and Image Understanding
  • Umar Iqbal + 5 more

A dual-source approach for 3D human pose estimation from single images

  • Research Article
  • Cite Count Icon 346
  • 10.1109/tpami.2019.2892985
LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images.
  • Jan 1, 2019
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Gregory Rogez + 2 more

We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant