Personatalk: Preserving Personalized Dynamic Speech Style In Talking Face Generation
Recent visual speaker authentication methods claimed their effectiveness against deepfake attacks. However, the success is attributed to the inadequacy of existing talking face generation methods to preserve the dynamic speech style of the speaker, which serves as the key cue for authentication methods in verification. To address this, we propose PersonaTalk, a speaker-specific method utilizing the speaker’s video data to enhance the fidelity of the speaker’s dynamic speech styles in generated videos. Our approach introduces a visual context block to integrate lip motion information into the audio features. Additionally, to enhance reading intelligibility in dubbed videos, a cross dubbing phase is incorporated during training. Experiments on the GRID dataset show the superiority of PersonaTalk over existing SOTA methods. These findings emphasize the need for enhanced defense measures in existing lip-based speaker authentication methods.
- Research Article
95
- 10.1609/aaai.v36i3.20154
- Jun 28, 2022
- Proceedings of the AAAI Conference on Artificial Intelligence
Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.
- Research Article
62
- 10.1609/aaai.v37i2.25280
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.