Towards Photo-Realistic Facial Expression Manipulation

Zhenglin Geng,Sergey Tulyakov,Chen Cao

doi:10.1007/s11263-020-01361-8

Abstract

We present a method for photo-realistic face manipulation. Given a single RGB face image with an arbitrary expression, our method can synthesize another arbitrary expression of the same person. To achieve this, we first fit a 3D face model and disentangle the face into its texture and shape. We then train separate networks in each of these spaces. In texture space, we use a conditional generative network to change the appearance, and carefully design the input format and loss functions to achieve the best results. In shape space, we use a fully connected network to predict an accurate face shape. When available, the shape branch uses depth data for supervision. Both networks are conditioned on expression coefficients rather than discrete labels, allowing us to generate an unlimited number of expressions. Furthermore, we adopt spatially adaptive denormalization on our texture space representation to improve the quality of the synthesized results. We show the superiority of this disentangling approach through both quantitative and qualitative studies. The proposed method does not require paired data, and is trained using an in-the-wild dataset of videos consisting of talking people. To achieve this, we present a simple yet efficient method to select appropriate key frames from these videos. In a user study, our method is preferred in 83.2% of cases when compared to state-of-the-art alternative approaches.

Full Text