Abstract

Abstract This paper describes an HMM-based speech synthesis sys-tem that utilizes glottal inverse filtering for generating naturalsounding synthetic speech. In the proposed system, speech isfirst parametrized into spectral and excitation features using aglottal inverse filtering based method. The parameters are fedinto an HMM system for training and then generated from thetrained HMM according to text input. Glottal flow pulses ex-tracted from real speech are used as a voice source, and thevoice source is further modified according to the all-pole modelparameters generated by the HMM. Preliminary experimentsshow that the proposed system is capable of generating naturalsounding speech, and the quality is clearly better compared to asystem utilizing a conventional impulse train excitation model.Index Terms: speech synthesis, glottal inverse filtering, HMM 1. Introduction The ultimate goal of text-to-speech synthesis (TTS) is to enablecreating natural sounding speech from arbitrary text. More-over, the current trend in TTS research calls for systems thatenable producing speech in different speaking styles with dif-ferent speaker characteristics and even emotions. In order tofulfill these stringent general requirements, two major synthe-sis techniques have attracted increasing interest in the speechresearch community during the past decade. These two alter-natives are (1) the unit selection technique and (2) the hiddenMarkov model (HMM) based approach. The former has beenshown to yield synthetic speech of highly natural quality. How-ever, unit selection techniques do not allow for easy adaptationof the TTS system to different speaking styles and speaker char-acteristics. In addition, their implementation requires databasesof extensive sizes, which severely limit the use of this TTS tech-nique, for example, in mobile terminals. HMM-based tech-niques, in turn, benefit from better adaptability and a clearlysmaller memory requirement. However, the current HMM sys-tems often suffer from degraded naturalness in quality. It canbe argued that a potential reason for the reduced naturalness inthe current HMM-based TTS systems can be explained by theuse of signal generation techniques which are oversimplified toproperly mimic natural speech pressure waveforms.A large part of what can be characterized as naturalnessin speech emerges from different voice characteristics as wellas their context dependent changes. Therefore, it is justifiedin speech synthesis to search for methods aiming at accuratemodeling of different voice characteristics as well as prosodicfeatures of speech. Towards these goals, HMM-based synthe-sizers have been developed with special emphasis on voice char-acteristics such as speaker individualities, speaking styles, andemotions [1]. Moreover, some recent studies have introducedimprovements to the parametric HMM systems’ signal genera-tion techniques by utilizing, for example, mixed excitation [2]and residual modeling [3]. These techniques have been shownto improve the quality of synthetic speech compared to systemsutilizing a traditional impulse train excitation model. However,the quality of the systems using these techniques still remainsfar from the quality of natural speech.In the real human voice production mechanism, the excita-tion of (voiced) speech is represented by the glottal volume ve-locity waveform generated by the vibrating vocal folds. This ex-citation signal, the glottal source, has naturally attracted interestin speech synthesis and many techniques have been proposed tomimic the glottal source of natural speech. One such techniqueis the Liljencrants-Fant (LF) model of the differentiated glottalsource that has been used both in traditional rule-based synthe-sis [4, 5] as well as within an HMM-based speech synthesizer[6]. However, the use of artificial glottal flow pulses usuallyresults in a somewhat buzzy quality due to a strong harmonicstructure at higher frequencies. To overcome this problem, theidea of utilizing glottal flow pulses extracted from real speechwith the help of glottal inverse filtering has been proposed [7, 8].However, previous studies based on glottal flow pulses extractedfrom natural speech are limited to special purposes such as thegeneration of isolated vowels, and the benefits from combiningautomatic glottal inverse filtering with an HMM-based speechsynthesizer have not been utilized.In this paper, a novel HMM-based speech synthesis sys-tem that utilizes glottal inverse filtering for generating naturalsounding synthetic speech is presented. The rest of the paper isorganized as follows: Section 2 describes the proposed speechsynthesis system. The results of the experiments with the newsynthesizer are presented in Section 3. Discussion on the pro-posed speech synthesis system and future plans are presented inSection 4, and final conclusions are presented in Section 5.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call