Talking face generation driven by time–frequency domain features of speech audio

Jiye Zhang,Yazhi Liu,Xiong Li,Wei Li,Ying Tang

doi:10.1016/j.displa.2023.102558

Abstract

A clear talking face video can be generated from a facial image driven by speech audio. However, the corresponding relationship between the generated video frames and the input speech audio is not well synchronized in the state-of-the-art methods. To solve this problem, this paper proposes a generative adversarial networks based talking face generation method driven by time–frequency features extracted from the input speech audio (TF2). We design an audio time series encoder that extracts joint time–frequency features of the input speech audio. A multi-level wavelet transform is designed to transform the speech audio signal from time to different frequency domains, and then gate recurrent unit is adopted to extract tempo-semantic correlation features from multi-frequency domain of the audio signal, enhancing the realistic order of the generated video frames. A smoothened formulation of dynamic time warping was introduced into our time series discriminator to determine the similarity between the order of the generated and the sample video frames. Experiments on LRW, VoxCeleb2, and GRID datasets show an improvement in PSNR, SSIM, and Syncconf by 0.4∼3.75 dB, 0.01∼0.04, and 0.4∼2.7, respectively, and a reduction in LMD by 0.15∼3.99 compared to existing methods. The experimental results confirm that the method can generate acoustically synchronized talking faces.

Full Text