Abstract

We present a method for photo-realistic face manipulation. Given a single RGB face image with an arbitrary expression, our method can synthesize another arbitrary expression of the same person. To achieve this, we first fit a 3D face model and disentangle the face into its texture and shape. We then train separate networks in each of these spaces. In texture space, we use a conditional generative network to change the appearance, and carefully design the input format and loss functions to achieve the best results. In shape space, we use a fully connected network to predict an accurate face shape. When available, the shape branch uses depth data for supervision. Both networks are conditioned on expression coefficients rather than discrete labels, allowing us to generate an unlimited number of expressions. Furthermore, we adopt spatially adaptive denormalization on our texture space representation to improve the quality of the synthesized results. We show the superiority of this disentangling approach through both quantitative and qualitative studies. The proposed method does not require paired data, and is trained using an in-the-wild dataset of videos consisting of talking people. To achieve this, we present a simple yet efficient method to select appropriate key frames from these videos. In a user study, our method is preferred in 83.2% of cases when compared to state-of-the-art alternative approaches.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.