Abstract

We propose an approach to accurately estimate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose embedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull. The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complementary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accuracy over prior methods. Extensive evaluation is performed with state of the art performance reported on the popular Human 3.6M dataset (Ionescu et al. in Intell IEEE Trans Pattern Anal Mach 36(7):1325–1339, 2014), the newly released TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi-viewpoint video, IMU and accurate 3D skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

Highlights

  • Challenging, marker-less real time 3D human pose estimation is attracting increasing research interest as it will deliver step changes to a wide range of fields, from biomechanics, psychology, animation, human computer interaction and computer vision

  • We introduce our new dataset called TotalCapture (Gilbert et al 2017) in Sect. 4.2, which contains both video and inertial measurement unit (IMU) with the associated GT joint skeleton

  • We evaluate our full fused vision and IMU approach on the TotalCapture dataset and we perform an ablation study in Sect. 4.4 to examine the individual contributions of our work

Read more

Summary

Introduction

Marker-less real time 3D human pose estimation is attracting increasing research interest as it will deliver step changes to a wide range of fields, from biomechanics, psychology, animation, human computer interaction and computer vision. 3D pose estimation suffers from a large number of challenges including large variation in appearance, arbitrary viewpoints and obstructed visibilities due to external entities and self-occlusions. To resolve these challenges effectively, marker based systems such as Vicon (http://www.vicon.com) or OptiTrack (http:// www.optitrack.com) are commonly used to provide sufficient joint accuracy. Approaches have tried to remove these constraints through the use of elaborate prior terms and body modelling (von Marcard et al 2017), or with the use of depth cameras (Yub et al 2016), or extending 2D estimation to 3D (Tome et al 2017; Tan et al 2017)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call