MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

Yi Lei,Xinsheng Wang,Shan Yang,Lei Xie

doi:10.1109/taslp.2022.3145293

Abstract

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2022
Citations: 35

Similar Papers

Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis
Jianhao Ye ... Wendi He
-
Jianhao Ye, et. al.Jianhao Ye ... Wendi He
01 Jan 2023
01 Jan 2023

Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data
Jialin Zhang ... Gulanbaier Tuerhong
Applied Sciences | VOL. 13
Jialin Zhang, et. al.Jialin Zhang ... Gulanbaier Tuerhong
06 May 2023
Applied Sciences | VOL. 13

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis
Yi Lei ... Shan Yang
-
Yi Lei, et. al.Yi Lei ... Shan Yang
19 Jan 2021
19 Jan 2021

Fundamental frequency adjustment and formant transition based emotional speech synthesis
Haojie Zhang ... Yong Yang
-
Haojie Zhang, et. al.Haojie Zhang ... Yong Yang
01 May 2012
01 May 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing