Abstract

The source-filter model is one of the main techniques applied to speech analysis and synthesis. Recent advances in voice production by means of three-dimensional (3D) source-filter models have overcome several limitations of classic one-dimensional techniques. Despite the development of preliminary attempts to improve the expressiveness of 3D-generated voices, they are still far from achieving realistic results. Towards this goal, this work analyses the contribution of both the the vocal tract (VT) and the glottal source spectral (GSS) cues in the generation of happy and aggressive speech through a GlottDNN-based analysis-by-synthesis methodology. Paired neutral expressive utterances are parameterised to generate different combinations of expressive vowels, applying the target expressive GSS and/or VT cues on the neutral vowels after transplanting the expressive prosody on these utterances. The conducted objective tests focused on Spanish [a], [i] and [u] vowels show that both GSS and VT cues significantly reduce the spectral distance to the expressive target. The results from the perceptual test show that VT cues make a statistically significant contribution in the expression of happy and aggressive emotions for [a] vowels, while the GSS contribution is significant in [i] and [u] vowels.

Highlights

  • Speech is an incredibly powerful means of communication, as it codifies linguistic information, i.e., a message, and provides paralinguistic cues about the emotional state of the speaker [1]

  • The contributions of glottal source spectral (GSS) and vocal tract (VT) cues to the generation of happy and aggressive expressive styles were evaluated through both objective measures and subjective tests

  • The contribution of GSSE is only significant for the stressed vowels, with an 19.2% increase

Read more

Summary

Introduction

Speech is an incredibly powerful means of communication, as it codifies linguistic information, i.e., a message, and provides paralinguistic cues about the emotional state of the speaker [1]. Computer interaction (HCI), both in the input channel, by means of emotional speech recognition [2] and in its output, through expressive speech synthesis [3], among others In this context, emotions have been traditionally represented: (i) using a dimensional space, such as the circumplex model, which is defined by arousal, valence, and dominance [4]; or, (ii) as discrete categories, such as those defined in [5] and denoted as the big six basic emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. In order to model the specific characteristics of spoken emotions, acoustic features can be extracted from these databases and analysed to describe them with respect to neutral speech [6], which is typically considered as reference of inexpressiveness In this regard, several studies focusing on basic emotions have found specific prosodic patterns of F0, energy and speech rate. For instance, present higher F0, energy and speech rate values compared with neutral speech, while the opposite occurs with sadness [7]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call