Abstract

Research on speech synthesis area has made great progress recently, perhaps motivated by its numerous applications, of which text-to-speech converters and dialog systems are examples. Several improvements have been reported in the technical literature related to existing state-of-the-art techniques as well as in the development of new ideas related to the alteration of voice characteristics, with their eventual application to different languages. Nevertheless, in spite of the attention that the speech synthesis field has been receiving, the technique which employs unit selection and concatenation of waveform segments still remains as the most popular approach among those available nowadays. In this paper, we report how a synthesizer for the Brazilian Portuguese language was constructed according to a technique in which the speech waveform is generated through parameters directly determined from Hidden Markov Models. When compared with systems based on unit selection and concatenation, the proposed synthesizer presents the advantage of being trainable, with the utilization of contextual factors including information related to different levels of the following acoustic units: phones, syllables, words, phrases and utterances. Such information is brought into effect through a set of questions for context-clustering. Thus, both the spectral and the prosodic characteristics of the system are managed by decision-trees generated for each one of the following parameters: mel-cepstral coefficients, fundamental frequency and state durations. As a typical characteristic of the technique based on Hidden Markov Models, synthesized speech with quality comparable to commercial applications built under the unit selection and concatenation approach can be obtained even from a database as small as eighteen minutes of speech. This was tested by a subjective comparison of samples from the synthesizer in question and other systems currently available for Brazilian Portuguese.

Highlights

  • Resumo - A pesquisa na area de sıntese de voz tem alcancado grande progresso recentemente, provavelmente motivada por suas inumeras aplicacoes, dentre as quais se pode citar conversores texto-voz e sistemas de dialogo

  • It should be noted that each utterance information produced by the natural language processing (NLP) modules connected to the Hidden Markov Model (HMM)-based and MBROLA synthesizers was manually corrected in order to avoid transcription and/or stress related errors on the synthesized speech

  • The description of a Brazilian Portuguese speech synthesizer with its corresponding characteristics was performed in this paper

Read more

Summary

INTRODUCTION

Resumo - A pesquisa na area de sıntese de voz tem alcancado grande progresso recentemente, provavelmente motivada por suas inumeras aplicacoes, dentre as quais se pode citar conversores texto-voz e sistemas de dialogo. One of the main advantages of the referred HMM-based synthesis technique when compared with the unit selection and concatenation method is the fact that voice alteration can be performed with no need of large databases [9,10,11] Another advantage is that synthesized speech with applicability can be achieved by training the system with a database as small as eighty sentences, as reported in [8]. One of the main disadvantages of the referred approach corresponds to the buzzy quality of the synthesized speech This drawback is caused by the source-filter model which is used during the waveform generation stage, which basically consists in a linear predictive vocoder, though in [14] it is reported that the mentioned buzz can be removed with the utilization of a mixed excitation scheme.

ENGINE DESCRIPTION
SPEECH PARAMETER EXTRACTION
HMM TRAINING
SYNTHESIS PART
PARAMETER DETERMINATION
EXCITATION CONSTRUCTION AND FILTERING
ASPECTS OF BRAZILIAN PORTUGUESE SPEECH SYNTHESIS BASED ON HMM
THE PHONE SET
DEFINITION OF AN UTTERANCE INFORMATION
TEXT PROCESSING
THE CONTEXTUAL FACTORS
CONTEXT CLUSTERING
THE CORPUS
PARAMETER EXTRACTION
GENERATED DECISION-TREES
EXAMPLE OF SYNTHESIS
INFLUENCE OF SOME CONTEXTUAL FACTORS ON THE SYNTHESIZED SPEECH
INFLUENCE OF POS AND SYLLABLE
INFLUENCE OF SYLLABLE STRESS
THE SYNTHESIZERS
THE SENTENCES
CONCLUSION AND FUTURE WORK
THE SUBJECTS
THE RESULTS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call