Abstract

In this article, we propose a speaker-dependent model interpolation method for statistical emotional speech synthesis. The basic idea is to combine the neutral model set of the target speaker and an emotional model set selected from a pool of speakers. For model selection and interpolation weight determination, we propose to use a novel monophone-based Mahalanobis distance, which is a proper distance measure between two Hidden Markov Model sets. We design Latin-square evaluation to reduce the systematic bias in the subjective listening tests. The proposed interpolation method achieves sound performance on the emotional expressiveness, the naturalness, and the target speaker similarity. Moreover, such performance is achieved without the need to collect the emotional speech of the target speaker, saving the cost of data collection and labeling.

Highlights

  • Statistical speech synthesis (SSS) is a fast-growing research area for text-to-speech (TTS) systems

  • Evaluation on emotional expressiveness For the emotional expressiveness, we evaluate whether the synthesized speech conveys the target emotion by subjective listening tests

  • A listener listens to five waveforms, and chooses the speech with the nominated emotion (1-out-of-5 choice of emotion identification)

Read more

Summary

Introduction

Statistical speech synthesis (SSS) is a fast-growing research area for text-to-speech (TTS) systems. While a state-of-the-art concatenative method [1,2] for TTS is capable of synthesizing natural and smooth speech for a specific voice, an SSS-based approach [3,4] has the strength to produce a diverse spectrum of voices without requiring significant amount of new data. The proposed method for emotional expressiveness based on speaker-dependent model interpolation is described in Section “Interpolation methods”. An HTS system for continuous Mandarin speech synthesis is constructed for this research In this system, the basic HMM units are the tonal phone models (TPMs) [27]. The question set for training the decision trees for state-tying consists of questions on the tonal context, the phonetic context, the syllabic context, the word context, the phrasal context, and the utterance context [28]

Background
Experiments
Method
Conclusion and future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call