Dialogue act based expressive speech synthesis in limited domain for the Czech language

Jindřich Matoušek,Zdeněk Hanzlíček,Martin Grůber,Daniel Tihelka

doi:10.31449/inf.v44i2.2559

Abstract

This paper deals with expressive speech synthesis in a dialogue. Dialogue acts - discrete expressive categories - are used for expressivity description. The aim of the work is to create a procedure for development of expressive speech synthesis for a dialogue system in a limited domain. The domain is here limited to dialogues between a human and a computer on a given topic of reminiscing about personal photographs. To incorporate expressivity into synthetic speech, modifications of current algorithms used for neutral speech synthesis are made. An expressive speech corpus is recorded, annotated using a predefined set of dialogue acts, and its acoustic analysis is performed. Unit selection and HMM-based methods are used to synthesize expressive speech, and an evaluation using listening tests is presented. The listeners asses two basic aspects of synthetic expressive speech for isolated utterances: speech quality and expressivity perception. The evaluation is also performed for utterances in a dialogue to asses appropriateness of synthetic expressive speech. It can be concluded that synthetic expressive speech is rated positively even though it is of worse quality when comparing with the neutral speech synthesis. However, synthetic expressive speech is able to transmit expressivity to listeners and to improve the naturalness of the synthetic speech.

Highlights

Nowadays, speech synthesis techniques produce high quality and intelligible speech
The results suggest that the quality of expressive synthetic speech is worse than the quality of neutral synthetic speech by 0.49 of the MOS score (13 %) in average
Even though this work deals mostly with the unit selection speech synthesis, the results of an experiment with the HMM-based expressive speech synthesis are to be briefly discussed

Summary

Introduction

Speech synthesis techniques produce high quality and intelligible speech. to use synthetic speech in dialogue systems (ticket booking [1], information on restaurants or hotels [2], flights [3], trains [4] or weather [5]) or in any other human-computer interactive systems (virtual computer companions, computer games), the voice interface should be more friendly to make the user to feel more involved in the interaction or communication. There are various methods to produce synthetic speech, the mostly used are unit selection [27], HMM-based methods [28], DNN-based methods [29] or other methods based on neural networks [30, 31] These methods can be certainly used for the expressive speech synthesis. Even though this work is mainly focused on using the unit selection method for expressive speech synthesis, a brief description of preliminary experiments with HMMbased method is presented. As the results of this work are to be used in a dialogue system, the suitability of produced expressive synthetic speech is evaluated directly in dialogues

Natural dialogues

Recording setup

Recording application description

Audiovisual database statistics

Texts preparation

Recording process

Expressivity description

Expressive corpus annotation

Listening test background

Objective annotation

General unit selection approach

Concatenation cost

Target cost

Advanced target cost for expressive speech synthesis

General penalty matrix

Basic target cost for expressive speech synthesis

Listening test based differences

Acoustic analysis based differences

Final penalty matrix

Weight tuning for dialogue act feature

Evaluation & results

Evaluation of the unit selection based expressive speech synthesis

Expressivity perception in synthetic speech

Quality evaluation

Evaluation of the HMM-based expressive speech synthesis

Evaluation of the expressivity in dialogues

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dialogue act based expressive speech synthesis in limited domain for the Czech language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Informatica

Lead the way for us

Journal: Informatica	Publication Date: Jun 15, 2020
License type: cc-by