Abstract

Abstract Disfluent speech synthesis is necessary in some applicationssuch as automatic film dubbing or spoken translation. This pa-per presents a model for the generation of synthetic disfluentspeech based on inserting each element of a disfluency in a con-text where they can be considered fluent. Prosody obtained bythe application of standard techniques on these new sentencesis used for the synthesis of the disfluent sentence. In addition,local modifications are applied to segmental units adjacent todisfluency elements. Experiments evidence that duration fol-lows this behavior, what supports the feasibility of the model.Index Terms: speech synthesis, disfluent speech, prosody, dis-fluencies. 1. Introduction Speech synthesis has already reached a high standard of natu-ralness [1], mainly due to the use of effective techniques suchas unit selection-based systems or other new rising technolo-gies [2] based on the analysis of huge speech corpora. The mainapplication of speech synthesis has been focused by now onread style speech since it can be considered that read style is themost generalist style to be extrapolated to any other situation.But nowadays, and even more in the future, applications of textto speech (TTS) systems (e.g. automatic film dubbing, robotics,dialogue systems, or multilingual broadcasting) demand for avariety of styles since the users expect the interface to do morethan just reading information.If synthetic voices want to be integrated in future technol-ogy, they must simulate the way people talk instead the waypeople read. Synthetic speech must become conversational-likerather than reading-like speech. Therefore, we claim it is neces-sary to move from reading to talking speech synthesizers. Bothstyles differ significantly from each other due to the inclusionof a variety of prosodic resources affecting the rhythm of theutterances. Disfluencies are one of these resources defined asphenomena that interrupt the flow of speech and do not addpropositional content to an utterance [3]. Despite the lack ofpropositional content, they may give cues about what is beingsaid to the listener [4]. Disfluencies are very frequent in everyday speech [5] so that it is possible to hypothesize the need toinclude these prosodic events to approximate to talking speechsynthesis.The study of disfluencies has been approach from from sev-eral disciplines, mainly phonetics [6, 5], psycholinguistics [7, 8]and speech recognition [9, 10]. Different approaches model dis-fluencies according to their specific interest. The use of disflu-encies in TTS systems brings additional considerations leadingus to introduce an alternative model. This model, in contrastwith others approaches used in TTS such as [11] or [12], con-siders the potential fluent sentences associated with the disfluentsentence and the local modifications produced when the editingterm is inserted. These local modifications can affect speechprosody and the quality of the original delivery. We show therelevance of this local modifications by studying the impact ofdisfluencies on the duration of the syllables surrounding theediting term of disfluent sentences.First we introduce the disfluent speech generation model.Second we present the experimental procedure to apply thismodel reflecting the impact on the duration of the syllables sur-rounding editing terms. Third, we discuss the future work to bedone in this ongoing research and the paper ends with conclu-sions.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call