Voice Quality Modelling for Expressive Speech Synthesis

Carlos Monzo,Joan Claudi Socoró,Ignasi Iriondo

doi:10.1155/2014/627189

Carlos Monzo, Joan Claudi Socoró + Show 1 more

Open Access

https://doi.org/10.1155/2014/627189

Copy DOI

Abstract

This paper presents the perceptual experiments that were carried out in order to validate the methodology of transforming expressive speech styles using voice quality (VoQ) parameters modelling, along with the well-known prosody (F 0, duration, and energy), from a neutral style into a number of expressive ones. The main goal was to validate the usefulness of VoQ in the enhancement of expressive synthetic speech in terms of speech quality and style identification. A harmonic plus noise model (HNM) was used to modify VoQ and prosodic parameters that were extracted from an expressive speech corpus. Perception test results indicated the improvement of obtained expressive speech styles using VoQ modelling along with prosodic characteristics.

Highlights

The research fields of automatic speech recognition (ASR) and text-to-speech (TTS) synthesis benefit from expressive speech, that is, speech with emotional content being this more spontaneous, to make human-machine interactions more natural, for example, in terms of emotion recognition [1, 2] and voice transformation [3,4,5]
The control of the amount of noise which appeared in the speech is carried out by means of Harmonic-to-noise ratio (HNR) parameter, useful during SEN style identification
The main aim of this work was to validate the usefulness of voice quality (VoQ) in the enhancement of expressive synthetic speech for style identification presenting an acceptable quality

Summary

Introduction

The research fields of automatic speech recognition (ASR) and text-to-speech (TTS) synthesis benefit from expressive speech, that is, speech with emotional content being this more spontaneous, to make human-machine interactions more natural, for example, in terms of emotion recognition [1, 2] and voice transformation [3,4,5]. Voice quality ( VoQ) and prosody parameters (F0, duration, and energy) can be conveniently manipulated to represent or convey the emotional content of speech in ASR or TTS applications respectively [1, 3, 6,7,8,9,10]. In spite of the fact that VoQ has been less explored than prosody, recent works propose using both types of data to improve the acoustic modelling of expressive speech [7,8,9,10]. The parameterisation of speech in both harmonic and stochastic components allows for flexible manipulation of VoQ over time and pitch scales, making it possible to maintain a high degree of natural speech quality

Objectives

Methods

Findings

Conclusion