Generating talking lips in sync with input speech has the potential to enhance speech communication and enable novel applications. This paper presents a system that can generate accurate 3D talking lips, readily applicable to unseen subjects and different languages. The developed head-mounted facial acquisition device and automated data processing pipeline can generate precise landmarks while mitigating the difficulty of acquiring 3D facial data. Our system consists of three stages to generate accurate lip movements. In the first stage, the fine-tuned Wav2Vec2.0+Transformer captures long-range audio context dependencies. In the second stage, we propose the Viseme Fixing method, which significantly improves lip accuracy at/b//p//m//f/ phonemes. In the last stage, we innovatively use the structural relationship between the inner and outer lips and learn to map the outer lip landmarks to the inner lip landmarks. Subjective evaluations show that the generated talking lips match the input audio significantly. We demonstrate two applications that animate 2D face videos and 3D face models using our landmarks. The precise lip landmarks allow the generated animations to exceed the results of state-of-the-art methods.
Read full abstract