Realistic Speech-Driven Facial Animation with GANs

Konstantinos Vougioukas,Maja Pantic,Stavros Petridis

doi:10.1007/s11263-019-01251-8

Konstantinos Vougioukas, Maja Pantic + Show 1 more

Open Access

https://doi.org/10.1007/s11263-019-01251-8

Copy DOI

Abstract

Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.

Highlights

Computer Generated Imagery (CGI) has become an inextricable part of the entertainment industry due to its ability to produce high quality results in a controllable manner
In order to drive down the cost and time required to produce high quality CGI researchers are looking into automatic face synthesis using machine learning techniques
Simons and Cox (1990) used vector quantization to achieve a compact representation of video and audio features, which were used as the states for their fully connected Markov model

Summary

Introduction

Computer Generated Imagery (CGI) has become an inextricable part of the entertainment industry due to its ability to produce high quality results in a controllable manner. The problem of generating realistic talking heads is multifaceted, requiring high-quality faces, lip movements synchronized with the audio, and plausible facial expressions. Subject independent approaches have been proposed that transform audio features to video frames (Chung et al 2017; Chen et al 2018) Most of these methods restrict the problem to generating only the mouth. The temporal discriminator helps with the generation of expressions and provides a small improvement in audio-visual correspondence, there is no way of ensuring that both these aspects are captured in the video To solve this problem we propose using 2 temporal discriminators to enforce audio-visual correspondence and realistic facial movements on the generated videos. We present the results of an online Turing test, where users are shown a series of generated and real videos and are asked to identify the real ones

Related Work

Visual Feature Selection and Blending

Synthesis Based on Hidden Markov Models

Synthesis Based on Deep Neural Networks

GAN-Based Video Synthesis

Speech-Driven Facial Synthesis

Generator

Content Encoder

Frame Decoder

Discriminators

Frame Discriminator

Synchronization Discriminator

Training

Datasets

Metrics

Experiments

Ablation Study

Method

Qualitative Results

Quantitative Results

Conclusion and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Computer Vision	Publication Date: Oct 13, 2019
Citations: 181	License type: open-access

R Discovery Prime

R Discovery Prime

Realistic Speech-Driven Facial Animation with GANs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computer Vision

Lead the way for us

Similar Papers

Identity-Preserving Realistic Talking Face Generation
Sanjana Sinha ... Sandika Biswas
-
Sanjana Sinha, et. al.Sanjana Sinha ... Sandika Biswas
01 Jul 2020
01 Jul 2020

Realtime speech-driven facial animation using Gaussian Mixture Models
Changwei Luo ... Jun Yu
-
Changwei Luo, et. al. Changwei Luo ... Jun Yu
01 Jul 2014
01 Jul 2014

Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture
Dipanjan Das ... Brojeshwar Bhowmick
-
Dipanjan Das, et. al.Dipanjan Das ... Brojeshwar Bhowmick
01 Jan 2020
01 Jan 2020

StyleTalk: One-Shot Talking Head Generation with Controllable Speaking Styles
Yifeng Ma ... Suzhen Wang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37
Yifeng Ma, et. al.Yifeng Ma ... Suzhen Wang
26 Jun 2023
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Realistic Speech-Driven Facial Animation with GANs

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computer Vision