Abstract

We propose a new method for single-camera real-world 3-D human pose estimation. Our method uses multitask training together with iterative pose refinement using a novel conditional attention mechanism. For iterative pose refinement, the output of each convolutional layer is conditioned on the latest pose estimate, using a conditioned squeeze-and-excitation network architecture that incorporates novel feedback connections. Multitask training on both an in-the-wild 2-D pose dataset and a controlled 3-D pose dataset allows for real-world 3-D pose estimation without the need for a large-scale in-the-wild 3-D pose dataset, which is unavailable. Experiments are performed on several real-world datasets, as well as the Human 3.6 Million and HumanEva-I datasets, to show that the combined attention mechanism, iterative refinement scheme, and multitask training allow us to achieve robust and competitive performance with only a simple network architecture. In addition, we show that our method is efficient enough to run on commodity hardware, producing pose estimates in real time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call