AbstractWe present a deep learning based technique that enables novel‐view videos of human performances to be synthesized from sparse multi‐view captures. While performance capturing from a sparse set of videos has received significant attention, there has been relatively less progress which is about non‐rigid objects (e.g., human bodies). The rich articulation modes of human body make it rather challenging to synthesize and interpolate the model well. To address this problem, we propose a novel deep learning based framework that directly predicts novel‐view videos of human performances without explicit 3D reconstruction. Our method is a composition of two steps: novel‐view prediction and detail enhancement. We first learn a novel deep generative query network for view prediction. We synthesize novel‐view performances from a sparse set of just five or less camera videos. Then, we use a new generative adversarial network to enhance fine‐scale details of the first step results. This opens up the possibility of high‐quality low‐cost video‐based performance synthesis, which is gaining popularity for VA and AR applications. We demonstrate a variety of promising results, where our method is able to synthesis more robust and accurate performances than existing state‐of‐the‐art approaches when only sparse views are available.
Read full abstract