Abstract
An end-to-end conversational speech synthesis that enables a flexible control of emotional states defined over emotion dimensions is proposed. The Tacotron 2 architecture is extended so as to receive the emotion dimensions as input. The model is first pre-trained using a large-scale spontaneous speech corpus, then fine-tuned using a natural dialogue speech corpus with manually annotated perceived emotion in a form of pleasantness and arousal. Since the corpus for pre-training does not have emotion information, we examined two pre-training & fine-tuning strategies, and showed that the one applying an emotion dimension estimator before the pre-training was superior. The result of subjective evaluation for the emotion controllability showed a correlation of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$R$</tex> = 0.48 for pleasantness and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$R$</tex> = 0.78 for arousal between given and perceived emotional state, indicating the effectiveness of the proposed conversational speech synthesis with emotion control.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have