Parallel and High-Fidelity Text-to-Lip Generation

Jinglin Liu,Wencan Huang,Zhiying Zhu,Zhou Zhao,Yi Ren,Nicholas Yuan,Baoxing Huai

doi:10.1609/aaai.v36i2.20066

Abstract

As a key component of talking face generation, lip movements generation determines the naturalness and coherence of the generated talking face video. Prior literature mainly focuses on speech-to-lip generation while there is a paucity in text-to-lip (T2L) generation. T2L is a challenging task and existing end-to-end works depend on the attention mechanism and autoregressive (AR) decoding manner. However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation. This encourages the research of parallel T2L generation. In this work, we propose a parallel decoding model for fast and high-fidelity text-to-lip generation (ParaLip). Specifically, we predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we incorporate the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem. Extensive experiments conducted on GRID and TCD-TIMIT datasets demonstrate the superiority of proposed methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Parallel and High-Fidelity Text-to-Lip Generation

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 28, 2022
Citations: 3

Similar Papers

Fine-grained talking face generation with video reinterpretation
Xin Huang ... Mingjie Wang
The Visual Computer | VOL. 37
Xin Huang, et. al.Xin Huang ... Mingjie Wang
22 Sep 2020
The Visual Computer | VOL. 37

Talking Face Generation by Conditional Recurrent Adversarial Network
Yang Song ... Jingwen Zhu
-
Yang Song, et. al.Yang Song ... Jingwen Zhu
01 Aug 2019
01 Aug 2019

Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action Units
Zhilei Liu ... Chongke Bi
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20
Zhilei Liu, et. al.Zhilei Liu ... Chongke Bi
23 Sep 2024
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20

Time-Series Prediction of the Oscillatory Phase of EEG Signals Using the Least Mean Square Algorithm-Based AR Model
Aqsa Shakeel ... Toshihisa Tanaka
Applied Sciences | VOL. 10
Aqsa Shakeel, et. al.Aqsa Shakeel ... Toshihisa Tanaka
23 May 2020
Applied Sciences | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallel and High-Fidelity Text-to-Lip Generation

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence