Lip Synchronization Research Articles

In recent years, the applications of digital humans have become increasingly widespread. One of the most challenging core technologies is the generation of highly realistic and automated 3D facial animation that combines facial movements and speech. The single-modal 3D facial animation driven by speech typically ignores the weak correlation between speech and upper facial movements as well as head posture. In contrast, the video-driven approach can perfectly solve the posture problem while obtaining natural expressions. However, mapping 2D facial information to 3D facial information may lead to information loss, which make lip synchronization generated by video-driven methods is not as good as the speech-driven methods trained on 4D facial data. Therefore, this paper proposes a dual-modal generation method that uses speech and video information to generate more natural and vivid 3D facial animation. Specifically, the lip movements related to speech are generated by speech-video information, while speech-uncorrelated postures and expressions are generated solely by video information. The speech-driven module is used to extract speech features, and its output lip animation is then used as the foundation for facial animation. The expression and pose module is used to extract temporal visual features for regressing expression and head posture parameters. We fuse speech and video features to obtain chin posture parameters related to lip movements, and use these parameters to fine-tune the lip animation generated form the speech-driven module. This paper introduces multiple consistency losses to enhance the network’s capability to generate expressions and postures. Experiments conducted on the LRS3, TCD-TIMIT and MEAD datasets show that the proposed method achieves better performance on evaluation metrics such as CER, WER, VER and VWER than the current state-of-the-art methods. In addition, a perceptual user study show that over 77% and 70% of cases believe that this paper’s method is better than the comparative algorithms EMOCA and SPECTRE in terms of realism. In terms of lip synchronization, it received over 79% and 66% of cases support, respectively. Both evaluation methods demonstrate the effectiveness of the proposed method.

Read full abstract

Although automatically animating audio-driven talking heads has recently received growing interest, previous efforts have mainly concentrated on achieving lip synchronization with the audio, neglecting two crucial elements for generating expressive videos: emotion style and art style. In this paper, we present an innovative audio-driven talking face generation method called Style2Talker. It involves two stylized stages, namely Style-E and Style-A, which integrate text-controlled emotion style and picture-controlled art style into the final output. In order to prepare the scarce emotional text descriptions corresponding to the videos, we propose a labor-free paradigm that employs large-scale pretrained models to automatically annotate emotional text labels for existing audio-visual datasets. Incorporating the synthetic emotion texts, the Style-E stage utilizes a large-scale CLIP model to extract emotion representations, which are combined with the audio, serving as the condition for an efficient latent diffusion model designed to produce emotional motion coefficients of a 3DMM model. Moving on to the Style-A stage, we develop a coefficient-driven motion generator and an art-specific style path embedded in the well-known StyleGAN. This allows us to synthesize high-resolution artistically stylized talking head videos using the generated emotional motion coefficients and an art style source picture. Moreover, to better preserve image details and avoid artifacts, we provide StyleGAN with the multi-scale content features extracted from the identity image and refine its intermediate feature maps by the designed content encoder and refinement network, respectively. Extensive experimental results demonstrate our method outperforms existing state-of-the-art methods in terms of audio-lip synchronization and performance of both emotion style and art style.

Read full abstract

Lip Synchronization Research Articles

Related Topics

Articles published on Lip Synchronization

Review of Talking Head Synthesis for Driving Mechanisms and Portrait Rendering

Audio-Driven Facial Animation with Deep Learning: A Survey

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

MILG: Realistic lip-sync video generation with audio-modulated image inpainting

Real Time Video Streaming Over Homogeneous Systems

Generating dynamic lip-syncing using target audio in a multimedia environment

3D facial animation driven by speech-video dual-modal signals

Analysis of Model Progression and Parity in Personalized Modeling of Lipsync Decoder to Cater Hearing Impaired Individuals

PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping

PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation

Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Developing conversational Virtual Humans for social emotion elicitation based on large language models

TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation.

LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound

Multimodal rhythm in TikTok videos: Exploring a recontextualization of the Gillard ‘misogyny speech’

Retracted: Evaluation and Analysis of Animation Multimedia 3D Lip Synchronization considering the Comprehensive Weighted Algorithm

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Lip Synchronization Research Articles

Related Topics

Articles published on Lip Synchronization

Review of Talking Head Synthesis for Driving Mechanisms and Portrait Rendering

Audio-Driven Facial Animation with Deep Learning: A Survey

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

MILG: Realistic lip-sync video generation with audio-modulated image inpainting

Real Time Video Streaming Over Homogeneous Systems

Generating dynamic lip-syncing using target audio in a multimedia environment

3D facial animation driven by speech-video dual-modal signals

Analysis of Model Progression and Parity in Personalized Modeling of Lipsync Decoder to Cater Hearing Impaired Individuals

PTUS: Photo-Realistic Talking Upper-Body Synthesis via 3D-Aware Motion Decomposition Warping

PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation

Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Developing conversational Virtual Humans for social emotion elicitation based on large language models

TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation.

LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Vocal Visage: Crafting Lifelike 3D Talking Faces from Static Images and Sound

Multimodal rhythm in TikTok videos: Exploring a recontextualization of the Gillard ‘misogyny speech’

Retracted: Evaluation and Analysis of Animation Multimedia 3D Lip Synchronization considering the Comprehensive Weighted Algorithm