On the generation of synthetic disfluent speech: local prosodic modifications caused by the insertion of editing terms

Jordi Adell,David Escudero-Mancebo,Antonio Bonafonte

doi:10.21437/interspeech.2008-559

Abstract

Abstract Disﬂuent speech synthesis is necessary in some applicationssuch as automatic ﬁlm dubbing or spoken translation. This pa-per presents a model for the generation of synthetic disﬂuentspeech based on inserting each element of a disﬂuency in a con-text where they can be considered ﬂuent. Prosody obtained bythe application of standard techniques on these new sentencesis used for the synthesis of the disﬂuent sentence. In addition,local modiﬁcations are applied to segmental units adjacent todisﬂuency elements. Experiments evidence that duration fol-lows this behavior, what supports the feasibility of the model.Index Terms: speech synthesis, disﬂuent speech, prosody, dis-ﬂuencies. 1. Introduction Speech synthesis has already reached a high standard of natu-ralness [1], mainly due to the use of effective techniques suchas unit selection-based systems or other new rising technolo-gies [2] based on the analysis of huge speech corpora. The mainapplication of speech synthesis has been focused by now onread style speech since it can be considered that read style is themost generalist style to be extrapolated to any other situation.But nowadays, and even more in the future, applications of textto speech (TTS) systems (e.g. automatic ﬁlm dubbing, robotics,dialogue systems, or multilingual broadcasting) demand for avariety of styles since the users expect the interface to do morethan just reading information.If synthetic voices want to be integrated in future technol-ogy, they must simulate the way people talk instead the waypeople read. Synthetic speech must become conversational-likerather than reading-like speech. Therefore, we claim it is neces-sary to move from reading to talking speech synthesizers. Bothstyles differ signiﬁcantly from each other due to the inclusionof a variety of prosodic resources affecting the rhythm of theutterances. Disﬂuencies are one of these resources deﬁned asphenomena that interrupt the ﬂow of speech and do not addpropositional content to an utterance [3]. Despite the lack ofpropositional content, they may give cues about what is beingsaid to the listener [4]. Disﬂuencies are very frequent in everyday speech [5] so that it is possible to hypothesize the need toinclude these prosodic events to approximate to talking speechsynthesis.The study of disﬂuencies has been approach from from sev-eral disciplines, mainly phonetics [6, 5], psycholinguistics [7, 8]and speech recognition [9, 10]. Different approaches model dis-ﬂuencies according to their speciﬁc interest. The use of disﬂu-encies in TTS systems brings additional considerations leadingus to introduce an alternative model. This model, in contrastwith others approaches used in TTS such as [11] or [12], con-siders the potential ﬂuent sentences associated with the disﬂuentsentence and the local modiﬁcations produced when the editingterm is inserted. These local modiﬁcations can affect speechprosody and the quality of the original delivery. We show therelevance of this local modiﬁcations by studying the impact ofdisﬂuencies on the duration of the syllables surrounding theediting term of disﬂuent sentences.First we introduce the disﬂuent speech generation model.Second we present the experimental procedure to apply thismodel reﬂecting the impact on the duration of the syllables sur-rounding editing terms. Third, we discuss the future work to bedone in this ongoing research and the paper ends with conclu-sions.

Full Text