Realistic Speech-Driven Talking Video Generation with Personalized Pose

Xu Zhang,Liguo Weng

doi:10.1155/2020/6629634

Abstract

In this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. The model training is difficult to converge, and the model effect is unstable in complex scenes. Existing speech-driven speaker methods cannot solve this problem well. The method proposed in this paper first generates the sequence of key points of the speaker’s face and body postures from the audio signal in real time and then visualizes these key points as a series of two-dimensional skeleton images. Subsequently, we generate the final real speaker video through the video generation network. We take a random sampling of audio clips, encode audio contents and temporal correlations using a more effective network structure, and optimize and iterate network outputs using differential loss and attitude perception loss, so as to obtain a smoother pose key-point sequence and better performance. In addition, by inserting a specified action frame into the synthesized human pose sequence window, action poses of the synthesized speaker are enriched, making the synthesis effect more realistic and natural. Then, the final speaker video is generated by the obtained gesture key points through the video generation network. In order to generate realistic and high-resolution pose detail videos, we insert a local attention mechanism into the key point network of the generated pose sequence and give higher attention to the local details of the characters through spatial weight masks. In order to verify the effectiveness of the proposed method, we used the objective evaluation index NME and user subjective evaluation methods, respectively. Experiment results showed that our method could vividly use audio contentsto generate corresponding speaker videos, and its lip-matching accuracy and expression postures are better than those of previous work. Compared with existing methods in the NME index and user subjective evaluation, our method showed better results.

Highlights

Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China
We propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. is is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. e model training is difficult to converge, and the model effect is unstable in complex scenes
We use the Dilated Depthwise Separable Residual (DDSR) unit to encode the audio features [4, 5], and use the GRU network layer [6] to learn the temporal features and constrain the network outputs using content loss functions. rough this network structure, the audio content and temporal correlation information are effectively encoded simultaneously, the facial key point index of the model output is lowered, and the mouth shapes and postures of the synthesized speaker video are matched with audio contents better, plus, the synthesized speaker video is more natural and realistic

Summary

Introduction

Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China. We propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. The existing methods [1, 2] input the speaker’s voice information into the recurrent neural network to obtain 3D face model parameters, map the fitted 3D face model to 2D key points as inputs of the video synthesis module, and output corresponding speaker pictures through the video synthesis model. Rough this network structure, the audio content and temporal correlation information are effectively encoded simultaneously, the facial key point index of the model output is lowered, and the mouth shapes and postures of the synthesized speaker video are matched with audio contents better, plus, the synthesized speaker video is more natural and realistic. In order to enrich the speaker’s detailed texture, we introduce a local attention mechanism in the key point network and add spatial weights to the face, fingers, and other parts of the character to get higher attentions

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Complexity	Publication Date: Dec 28, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complexity

Lead the way for us

Similar Papers

Cow key point detection in indoor housing conditions with a deep learning model
M Taghavi ... I Adriaens
Journal of Dairy Science | VOL. 107
M Taghavi, et. al.M Taghavi ... I Adriaens
19 Oct 2023
Journal of Dairy Science | VOL. 107

An Unsupervised Video Stabilization Algorithm Based on Key Point Detection
Yue Luan ... Bingran Wang
Entropy | VOL. 24
Yue Luan, et. al.Yue Luan ... Bingran Wang
21 Sep 2022
Entropy | VOL. 24

A fast image matching algorithm based on key points
Huilin Wang ... Ying Wang
-
Huilin Wang, et. al.Huilin Wang ... Ying Wang
14 May 2014
14 May 2014

Goodpoint: unsupervised learning of key point detection and description
A.V Belikov ... A.V Yashchenko
Scientific and Technical Journal of Information Technologies, Mechanics and Optics | VOL. 21
A.V Belikov, et. al.A.V Belikov ... A.V Yashchenko
01 Feb 2021
Scientific and Technical Journal of Information Technologies, Mechanics and Optics | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complexity