Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency

Lingyun Yu,Qiang Ling,Jun Yu,Mengyan Li

doi:10.1109/tcsvt.2020.2973374

Abstract

Given an arbitrary speech clip or text information as input, the proposed work aims to generate a talking face video with accurate lip synchronization. Existing works mainly have three limitations. (1) A single-modal learning is adopted with either audio or text as input, hence it lacks the complementarity of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multimodal inputs</i> . (2) Each frame is generated independently, hence it ignores the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">temporal dependency</i> between consecutive frames. (3) Each face image is generated by the traditional convolution neural network (CNN) with a local receptive field, hence it cannot effectively capture the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">spatial dependency</i> within internal representations of face images. To overcome these problems above, we decompose the talking face generation task into two steps: mouth landmarks prediction and video synthesis. First, a multimodal learning method is proposed to generate accurate mouth landmarks with multimedia inputs (both text and audio). Second, a network named Face2Vid is proposed to generate video frames conditioned on the predicted mouth landmarks. In Face2Vid, the optical flow is employed to model the temporal dependency between frames, meanwhile, a self-attention mechanism is introduced to model the spatial dependency across image regions. Extensive experiments demonstrate that our approach can generate photo-realistic video frames with the background, and exhibit the superiorities on accurate synchronization of lip movements and smooth transition of facial movements.

Full Text