Emotional speech synthesis system based on a three-layered model using a dimensional approach

Yawen Xue,Masato Akagi,Yasuhiro Hamada

doi:10.1109/apsipa.2015.7415323

Abstract

This paper proposes an emotional speech synthesis system based on a three-layered model using a dimensional approach. Most previous studies related to emotional speech synthesis using the dimensional approach focused on the relationship between acoustic features and emotion dimensions (valence and activation) only. However, people do not perceive emotion directly from acoustic features. Hence, the acoustic features have being particularly difficult to predict, and the affectiveness of the synthesized sound is far from that intended. The ultimate goal of this research is to improve the accuracy of acoustic feature estimation and modification rules in order to synthesize affective speech more similar to that intended in the dimensional emotion space. The proposed system is composed by three layers: acoustic features, semantic primitives, and emotion dimensions. Fuzzy Inference System (FIS) is used to connect the three layers. The related acoustic features of each semantic primitive are selected for synthesizing the emotional speech. On the basis of morphing rules, the estimated acoustic features can be applied to synthesize emotional speech. Listening tests were carried out to verify whether the synthesized speech can give the intended impression in the dimensional emotion space. Results show that not only is the accuracy of estimated acoustic features raised but also the modification rules work well for the synthesized speech, resulting in the proposed method improving the quality of synthesized speech.

Full Text