Abstract

Audio-driven talking face video generation has attracted much attention recently. However, few existing works pay attention to machine learning of talking head movement, especially based on the phonetic study. Observing that real-world talking faces often accompany natural head movement, in this paper, we model the relation between speech signal and talking head movement, which is a typical one-to-many mapping problem. To solve this problem, we propose a novel two-step mapping strategy: (1) in the first step, we train an encoder that predicts a head motion behavior pattern (modeled as a feature vector) from the head motion sequence of a short video of 10-15 seconds, and (2) in the second step, we train a decoder that predict a unique head motion sequence from both the motion behavior pattern and the auditory features of an arbitrary speech signal. Based on the proposed mapping strategy, we build a deep neural network model that takes a speech signal of a source person and a short video of a target person as input, and outputs a synthesized high-fidelity talking face video with personalized head pose. Extensive experiments and a user study show that our method can generate high-quality personalized head movement in synthesized talking face videos, and meanwhile, has comparable facial animation quality (e.g., lip synchronization and expression) with the state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.