Talking head, driving a source image to generate a talking video using other modality information, has made great progress in recent years. However, there are two main issues: 1) These methods are designed to utilize a single modality of information. 2) Most methods cannot control head pose. To address these problems, we propose a novel framework that can utilize multi-modal information to generate a talking head video, while achieving arbitrary head pose control by a movement sequence. Specifically, first, to extend driving information to multiple modalities, multi-modal information is encoded to a unified semantic latent space to generate expression parameters. Secondly, to disentangle attributes, the 3D Morphable Model (3DMM) is utilized to obtain identity information from the source image, and translation and rotation information from the target image. Thirdly, to control head pose and mouth shape, the source image is warped by a motion field generated by the expression parameter, translation parameter, and angle parameter. Finally, all the above parameters are utilized to render a landmark map, and the warped source image is combined with the landmark map to generate a delicate talking head video. Our experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance in terms of visual quality, lip-audio synchronization, and head pose control.
Read full abstract