Towards Realizing Mandarin-Tibetan Bi-lingual Emotional Speech Synthesis with Mandarin Emotional Training Corpus

Peiwen Wu,Zhenye Gan,Hongwu Yang

doi:10.1007/978-981-10-6388-6_11

Abstract

This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus. A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training. Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adaptation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker dependent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the average emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus.

Full Text