An Iterated Two-Step Sinusoidal Pitch Contour Formulation for Expressive Speech Synthesis

Noraini Seman,Nursuriati Jamil,Izzad Ramli

doi:10.32890/jict2021.20.4.2

Noraini Seman, Nursuriati Jamil + Show 1 more

Open Access

https://doi.org/10.32890/jict2021.20.4.2

Copy DOI

Abstract

Intonation generation in expressive speech such as storytelling is essential to produce high quality Malay language expressive speech synthesizer. Intonation generation, for instance explicit control, has shown good performance in terms of intelligibility with reasonably natural speech; thus, it was selected in this research. This approach modifies the prosodic features, such as pitch contour, intensity, and duration, to generate the intonation. However, modification of pitch contour remains a problem because the desired pitch contour is not achieved. This paper formulated an improved pitch contour algorithm to develop a modified pitch contour resembling the natural pitch contour. In this work, the syllable pitch contours of nine storytellers were extracted from their storytelling speeches to create an expressive speech syllable dataset called STORY_DATA. All the shapes of pitch contours from STORY_DATA were analyzed and clustered into the standard six main pitch contour clusters for storytelling. The clustering was performed using one minus the Pearson product moment correlation. Then, an improved iterative two-step sinusoidal pitch contour formulation was introduced to modify the pitch contours of a neutral speech into an expressive pitch contour of natural speeches. Overall, the improved pitch contour formulation was able to achieve 93 percent high correlated matches, indicating the high resemblance as compared to the previous pitch contour formulation at 15 percent. Therefore, the improved formula can be used in a text-to-speech (TTS) synthesizer to produce a more natural expressive speech. The paper also discovered unique expressive pitch contours in the Malay language that need further investigations in the future.

Highlights

Expressive speech synthesis has gained interest in the last decade
The new proposed iterated two-step pitch contour formulation and Equation 2 were tested to evaluate the performance of converting a neutral pitch contour into a natural storytelling pitch contour
The neutral pitch contours from the test dataset were modified using the proposed iterated two-step pitch contour formulation to produce the pitch contours altered by the proposed Equation 5

Summary

Introduction

Expressive text-to-speech can be widely used in a variety of applications such as diagnosis and therapy for communication disorders like dyslexia or autism (Plaisant et al, 2000). It is crucial for the growth of digital speech (Lunce, 2007) and humanoid robots (Gelin et al, 2010). Expressive speech synthesis approaches are introduced to generate natural human-like synthesized speech with emotion and speaking style. Playback approaches concatenate large expressive speech units, such as syllables or phonemes, using unit selection method. This study focuses on using the explicit control approach to produce expressive speech

Methods

Results

Conclusion