Emotional Audio-Visual Speech Synthesis Based on PAD

Jia Jia,Yongxin Wang,Lianhong Cai,Fanbo Meng,Shen Zhang

doi:10.1109/tasl.2010.2052246

Abstract

Audio-visual speech synthesis is the core function for realizing face-to-face human-computer communication. While considerable efforts have been made to enable talking with computer like people, how to integrate the emotional expressions into the audio-visual speech synthesis remains largely a problem. In this paper, we adopt the notion of Pleasure-Displeasure, Arousal-Nonarousal, and Dominance-Submissiveness (PAD) 3-D-emotional space, in which emotions can be described and quantified from three different dimensions. Based on this new definition, we propose a unified model for emotional speech conversion using Boosting-Gaussian mixture model (GMM), as well as a facial expression synthesis model. We further present an emotional audio-visual speech synthesis approach. Specifically, we take the text and the target PAD values as input, and employ the text-to-speech (TTS) engine to first generate the neutral speeches. Then the Boosting-GMM is used to convert the neutral speeches to emotional speeches, and the facial expression is synthesized simultaneously. Finally, the acoustic features of the emotional speech are used to modulate the facial expression in the audio-visual speech. We designed three objective and five subjective experiments to evaluate the performance of each model and the overall approach. Our experimental results on audio-visual emotional speech datasets show that the proposed approach can effectively and efficiently synthesize natural and expressive emotional audio-visual speeches. Analysis on the results also unveil that the mutually reinforcing relationship indeed exists between audio and video information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Emotional Audio-Visual Speech Synthesis Based on PAD

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE Transactions on Audio, Speech, and Language Processing	Publication Date: Mar 1, 2011
Citations: 47

Similar Papers

Use of Emotional and Neutral Speech in Evaluating Compression Speeds.
Christopher Slugocki ... Francis Kuk
Journal of the American Academy of Audiology | VOL. 32
Christopher Slugocki, et. al.Christopher Slugocki ... Francis Kuk
01 Apr 2021
Journal of the American Academy of Audiology | VOL. 32

A multimodal dynamical variational autoencoder for audiovisual speech representation learning
Samir Sadok ... Renaud Séguier
Neural Networks | VOL. 172
Samir Sadok, et. al.Samir Sadok ... Renaud Séguier
11 Jan 2024
Neural Networks | VOL. 172

Multi-speaker emotional speech synthesis with limited datasets: Two-stage non-parallel training strategy
Kai He ... Caixia Sun
-
Kai He, et. al.Kai He ... Caixia Sun
15 Apr 2022
15 Apr 2022

A DNN-based emotional speech synthesis by speaker adaptation
Hongwu Yang ... Weizhao Zhang
-
Hongwu Yang, et. al.Hongwu Yang ... Weizhao Zhang
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Emotional Audio-Visual Speech Synthesis Based on PAD

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Audio, Speech, and Language Processing