Machine learning for human performance capture from multi-viewpoint video.

Matthew Trumble

doi:10.15126/thesis.00850064

Abstract

Performance capture is used extensively within the creative industries to efficiently produce high quality, realistic character animation in movies and video games. Existing commercial systems for performance capture are limited to working within constrained environments, requiring wearable visual markers or suits, and frequently specialised imaging devices (e.g. infra-red cameras) both of which limit deployment scenarios (e.g. indoor capture). This thesis explores novel methods to relax these constraints, applying machine learning techniques to estimate human pose using regular video cameras and without the requirement of visible markers on the performer. This unlocks the potential for co-production of principal footage and performance capture data, leading to production efficiencies. For example, using an array of static witness cameras deployed on-set, performance capture data for a video games character accompanying a major movie franchise might be captured at the same time the movie is shot. The need to call the actor for a second day of shooting in a specialised motion capture (mo-cap) facility is avoided, saving time and money, since performance capture was possible without corrupting the principal movie footage with markers or constraining set design. Furthermore, if such performance capture data is available in real-time, the director may immediately pre-visualize the look and feel of the final character animation enabling tighter capture iteration and improved creative direction. This further enhances the potential for production efficiencies. The core technical contributions of this thesis are novel software algorithms that leverage machine learning to fuse of data from multiple sensors – synchronised video cameras, and in some cases, inertial measurement units (IMUs) – in order to robustly estimate human body pose over time, doing so at real-time or near real-time rates. Firstly, a hardware-accelerated capture solution is developed for acquiring coarse volumetric occupancy data from multiple viewpoint video footage, in the form of a probabilistic visual hull (PVH). Using CUDA-based GPU acceleration the PVH may be estimated in real-time, and subsequently used to train machine learning algorithms to infer human skeletal pose from PVH data. Initially a variety of machine learning approaches for skeletal joint pose estimation are explored, contrasting classical and deep inference methods. By quantizing volumetric data into a two-dimensional (2D) spherical histogram representation it is shown that convolutional neural networks (CNN) architectures used traditionally for object recognition may be re-purposed for skeletal joint estimation given suitable a training methodology and data augmentation strategy. The generalization of such architectures to a fully volumetric (3D) CNN is explored, achieving state of the art performance at human pose estimation using an volumetric auto-encoder (hour-glass) architecture that emulates networks traditionally used for de-noising and super-resolution (up-scaling) of 2D data. A framework is developed that is capable of simultaneously estimating human pose from volumetric data, whilst also up-scaling that volumetric data to enable fine-grain estimation of surface detail given a deeply learned prior from previous performance. The method is shown to generalise well even when that prior is learned across different subjects, performing different movements even in different studio camera configurations. Performance can be further improved using a learned temporal model of data, and through the fusion of complementary sensor modalities – video and IMUs – to enhance the accuracy of human pose estimation inferred from a volumetric CNN. Although IMUs have been applied in the performance capture domain for many years, they are prone to drift limiting their use to short capture sequences. The novel fusion of IMU with video data enables improved global localization and so reduced error over time whilst simultaneously mitigating the issues of limb inter-occlusion that can frustrate video-only approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Machine learning for human performance capture from multi-viewpoint video.

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An Open-Source Platform for Human Pose Estimation and Tracking Using a Heterogeneous Multi-Sensor System
Ashok Kumar Patil ... Young Ho Chai
Sensors | VOL. 21
Ashok Kumar Patil, et. al.Ashok Kumar Patil ... Young Ho Chai
27 Mar 2021
Sensors | VOL. 21

Contour Model and Robust Segmentation based Human Pose Estimation in Images and Videos
Yunheng Liu
International Journal of Signal Processing, Image Processing and Pattern Recognition | VOL. 8
Yunheng LiuYunheng Liu
31 Mar 2015
International Journal of Signal Processing, Image Processing and Pattern Recognition | VOL. 8

A Hybrid Neural Network for Graph-Based Human Pose Estimation From 2D Images
Huynh The Vu ... Richardt H Wilkinson
IEEE Access | VOL. 8
Huynh The Vu, et. al.Huynh The Vu ... Richardt H Wilkinson
01 Jan 2020
IEEE Access | VOL. 8

Combining local appearance and holistic view: Dual-Source Deep Neural Networks for human pose estimation
Xiaochuan Fan ... Song Wang
-
Xiaochuan Fan, et. al. Xiaochuan Fan ... Song Wang
01 Jun 2015
01 Jun 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine learning for human performance capture from multi-viewpoint video.

Abstract

Talk to us

Similar Papers