Speech driven facial animation generation based on GAN

Xiong Li,Jiye Zhang,Yazhi Liu

doi:10.1016/j.displa.2022.102260

Abstract

A facial animation generation model can generate talking face videos from speech audio clips and face images. The frame sequence in the generated animation should be well synchronized with the source audio clips. A Facial Animation generation model based on an Adversarial Network (FAAN) model is proposed in this paper to generate video-speech synchronized facial animation from real human speech and a face image. The model maps features of the face image and the natural speech to a public space during the encoding process and then generates a frame sequence of a talking face according to the temporal coherence features contained in the speech fragments. To improve the synchronization accuracy between the generated video frame and source audio sequences, the model design should be a conditional least squares GAN, in which the temporal sequence of features extracted from the speech audio clips is inputted as a condition of the sequence discriminator. Furthermore, the temporal sequence of features extracted from the speech audio clips is added to the public space after the audio features are coupled with the facial features. Another conditional GAN, in which the input video frame is used as the condition of the frame discriminator, is designed in this model to improve the authenticity of the generated video frames. The model is ablated using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Frechette Inception Distance (FID). The experimental results show that the PSNR and SSIM scores of the FAAN model are better than those of the latest model on the GRID dataset; on the LRW dataset, the FANN model has the highest PSNR score. Finally, the validity of the proposed model is demonstrated by generating an optical flow diagram of the video frames, which shows that the model can elaborately drive the animation of the mouth in the generated video frames. Source code and videos are available at https://github.com/zjy-2020/Speech2video-FAAN/

Full Text