Abstract

Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.

Highlights

  • Computer Generated Imagery (CGI) has become an inextricable part of the entertainment industry due to its ability to produce high quality results in a controllable manner

  • In order to drive down the cost and time required to produce high quality CGI researchers are looking into automatic face synthesis using machine learning techniques

  • Simons and Cox (1990) used vector quantization to achieve a compact representation of video and audio features, which were used as the states for their fully connected Markov model

Read more

Summary

Introduction

Computer Generated Imagery (CGI) has become an inextricable part of the entertainment industry due to its ability to produce high quality results in a controllable manner. The problem of generating realistic talking heads is multifaceted, requiring high-quality faces, lip movements synchronized with the audio, and plausible facial expressions. Subject independent approaches have been proposed that transform audio features to video frames (Chung et al 2017; Chen et al 2018) Most of these methods restrict the problem to generating only the mouth. The temporal discriminator helps with the generation of expressions and provides a small improvement in audio-visual correspondence, there is no way of ensuring that both these aspects are captured in the video To solve this problem we propose using 2 temporal discriminators to enforce audio-visual correspondence and realistic facial movements on the generated videos. We present the results of an online Turing test, where users are shown a series of generated and real videos and are asked to identify the real ones

Related Work
Visual Feature Selection and Blending
Synthesis Based on Hidden Markov Models
Synthesis Based on Deep Neural Networks
GAN-Based Video Synthesis
Speech-Driven Facial Synthesis
Generator
Content Encoder
Frame Decoder
Discriminators
Frame Discriminator
Synchronization Discriminator
Training
Datasets
Metrics
Experiments
Ablation Study
Method
Qualitative Results
Quantitative Results
Conclusion and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.