Abstract. Talking head synthesis has emerged as a vital area of research, enabling the generation of realistic and expressive digital avatars. This paper explores the primary mechanisms driving talking head synthesis, categorized into video-driven and audio-driven methods. Video-driven techniques manipulate facial movements using key points, 3D meshes, and latent spaces, while audio-driven approaches focus on synchronizing lip movements and facial expressions with audio inputs. Recent advances in each method, highlighting key innovations and the challenges faced, such as occlusion, identity preservation, and lip synchronization are reviewed. The technology's applications span smart customer service, online education, telemedicine, and video creation. Future research directions focus on overcoming challenges like handling large-angle poses, ensuring temporal consistency, and improving multilingual performance.
Read full abstract