Learning Prosodic Stress from Data in Neural Network based Text-to-Speech Synthesis

Milan Sečujski,Darko Pekar,Siniša Suzić,Stevan Ostrogonac

doi:10.15622/sp.59.8

Abstract

Naturalness is one of the most important aspects of synthesized speech, and state-of-the-art parametric speech synthesizers require training on large quantities of annotated speech data to be able to convey prosodic elements such as pitch accent and phrase boundary tone. The most frequently used framework for prosodic annotation of speech in American English is Tones and Break Indices – ToBI, which has also been adapted for use in a number of other languages. This paper presents certain deficiencies of ToBI when applied in synthesis of speech in American English, which are related to the absence of tags specifically intended to mark differences in the level of prosodic stress (emphasis) related to a particular sentence constituent. The research presented in the paper proposes the introduction of a set of tags intended for explicit modeling of the degree of prosodic stress. Namely, a certain sentence constituent can be particularly emphasized, when it is the intended focus of the utterance, or it can be de-emphasized, as is commonly the case with phrases reporting direct speech or with comment clauses. Through several listening tests it has been shown that learning such prosodic events from data has distinct advantages over approaches attempting to exploit the existing ToBI tags to convey the degree of emphasis in synthesized speech. Namely, speech synthesized by a neural network trained on data tagged for the level of prosodic stress appears more natural, and the listeners are more successful in locating the sentence constituent carrying prosodic stress.

Highlights

The quality of text-to-speech (TTS) synthesis systems is generally rated in terms of the intelligibility and the naturalness of the speech they produce
The issue of prosodic stress as well as reproduction of utterances containing direct speech and reporting phrases are given particular attention. To overcome these shortcomings of Tone and Break Indices (ToBI), the study described in this paper proposes an extension to the standard set of ToBI tags, which consists of the introduction of explicit marking of the degree of emphasis that the speaker associates with particular sentence constituents
The paper has presented a research aimed at increasing the quality of synthesis of expressive speech based on more adequate modeling of linguistically relevant prosodic features of speech, including prosodic stress and delivery of speech in a compressed f0 range

Summary

Introduction

The quality of text-to-speech (TTS) synthesis systems is generally rated in terms of the intelligibility and the naturalness of the speech they produce. The most common reasons for this are related to the lack of training data, as well as the fact that the acoustic realizations of prosodic stress may be highly variable, and www.proceedings.spiiras.nw.ru commonly affect the intonation contour, and the duration of particular phonetic segments as well as the manner of their articulation.

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning Prosodic Stress from Data in Neural Network based Text-to-Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SPIIRAS Proceedings

Lead the way for us

Journal: SPIIRAS Proceedings	Publication Date: Aug 1, 2018
License type: cc-by

Similar Papers

Acoustic and temporal representations in convolutional neural network models of prosodic events
Sabrina Stehwien ... Ngoc Thang Vu
Speech Communication | VOL. 125
Sabrina Stehwien, et. al.Sabrina Stehwien ... Ngoc Thang Vu
05 Nov 2020
Speech Communication | VOL. 125

On the acoustic correlates of high and low nuclear pitch accents in American English
Yen-Liang Shue ... Abeer Alwan
Speech Communication | VOL. 52
Yen-Liang Shue, et. al.Yen-Liang Shue ... Abeer Alwan
28 Aug 2009
Speech Communication | VOL. 52

MAE_ToBI Reflects Where EFL Students Fail In Acquiring the Intonation of English

-

01 Dec 2006
01 Dec 2006

Reconsidering Low-rising Intonation in American English
J M Levis
Applied Linguistics | VOL. 23
J M LevisJ M Levis
01 Mar 2002
Applied Linguistics | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning Prosodic Stress from Data in Neural Network based Text-to-Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SPIIRAS Proceedings