Predicting Personalized Head Movement From Short Video and Speech Signal

Ran Yi,Hujun Bao,Juyong Zhang,Guoxin Zhang,Pengfei Wan,Yong-Jin Liu,Zipeng Ye,Zhiyao Sun

doi:10.1109/tmm.2022.3207606

Abstract

Audio-driven talking face video generation has attracted much attention recently. However, few existing works pay attention to machine learning of talking head movement, especially based on the phonetic study. Observing that real-world talking faces often accompany natural head movement, in this paper, we model the relation between speech signal and talking head movement, which is a typical one-to-many mapping problem. To solve this problem, we propose a novel two-step mapping strategy: (1) in the first step, we train an encoder that predicts a head motion behavior pattern (modeled as a feature vector) from the head motion sequence of a short video of 10-15 seconds, and (2) in the second step, we train a decoder that predict a unique head motion sequence from both the motion behavior pattern and the auditory features of an arbitrary speech signal. Based on the proposed mapping strategy, we build a deep neural network model that takes a speech signal of a source person and a short video of a target person as input, and outputs a synthesized high-fidelity talking face video with personalized head pose. Extensive experiments and a user study show that our method can generate high-quality personalized head movement in synthesized talking face videos, and meanwhile, has comparable facial animation quality (e.g., lip synchronization and expression) with the state-of-the-art methods.

Full Text