HMM-based Finnish text-to-speech system utilizing glottal inverse filtering

Tuomo Raitio,Hannu Pulakka,Antti Suni,Paavo Alku,Martti Vainio

doi:10.21437/interspeech.2008-189

Abstract

Abstract This paper describes an HMM-based speech synthesis sys-tem that utilizes glottal inverse ﬁltering for generating naturalsounding synthetic speech. In the proposed system, speech isﬁrst parametrized into spectral and excitation features using aglottal inverse ﬁltering based method. The parameters are fedinto an HMM system for training and then generated from thetrained HMM according to text input. Glottal ﬂow pulses ex-tracted from real speech are used as a voice source, and thevoice source is further modiﬁed according to the all-pole modelparameters generated by the HMM. Preliminary experimentsshow that the proposed system is capable of generating naturalsounding speech, and the quality is clearly better compared to asystem utilizing a conventional impulse train excitation model.Index Terms: speech synthesis, glottal inverse ﬁltering, HMM 1. Introduction The ultimate goal of text-to-speech synthesis (TTS) is to enablecreating natural sounding speech from arbitrary text. More-over, the current trend in TTS research calls for systems thatenable producing speech in different speaking styles with dif-ferent speaker characteristics and even emotions. In order tofulﬁll these stringent general requirements, two major synthe-sis techniques have attracted increasing interest in the speechresearch community during the past decade. These two alter-natives are (1) the unit selection technique and (2) the hiddenMarkov model (HMM) based approach. The former has beenshown to yield synthetic speech of highly natural quality. How-ever, unit selection techniques do not allow for easy adaptationof the TTS system to different speaking styles and speaker char-acteristics. In addition, their implementation requires databasesof extensive sizes, which severely limit the use of this TTS tech-nique, for example, in mobile terminals. HMM-based tech-niques, in turn, beneﬁt from better adaptability and a clearlysmaller memory requirement. However, the current HMM sys-tems often suffer from degraded naturalness in quality. It canbe argued that a potential reason for the reduced naturalness inthe current HMM-based TTS systems can be explained by theuse of signal generation techniques which are oversimpliﬁed toproperly mimic natural speech pressure waveforms.A large part of what can be characterized as naturalnessin speech emerges from different voice characteristics as wellas their context dependent changes. Therefore, it is justiﬁedin speech synthesis to search for methods aiming at accuratemodeling of different voice characteristics as well as prosodicfeatures of speech. Towards these goals, HMM-based synthe-sizers have been developed with special emphasis on voice char-acteristics such as speaker individualities, speaking styles, andemotions [1]. Moreover, some recent studies have introducedimprovements to the parametric HMM systems’ signal genera-tion techniques by utilizing, for example, mixed excitation [2]and residual modeling [3]. These techniques have been shownto improve the quality of synthetic speech compared to systemsutilizing a traditional impulse train excitation model. However,the quality of the systems using these techniques still remainsfar from the quality of natural speech.In the real human voice production mechanism, the excita-tion of (voiced) speech is represented by the glottal volume ve-locity waveform generated by the vibrating vocal folds. This ex-citation signal, the glottal source, has naturally attracted interestin speech synthesis and many techniques have been proposed tomimic the glottal source of natural speech. One such techniqueis the Liljencrants-Fant (LF) model of the differentiated glottalsource that has been used both in traditional rule-based synthe-sis [4, 5] as well as within an HMM-based speech synthesizer[6]. However, the use of artiﬁcial glottal ﬂow pulses usuallyresults in a somewhat buzzy quality due to a strong harmonicstructure at higher frequencies. To overcome this problem, theidea of utilizing glottal ﬂow pulses extracted from real speechwith the help of glottal inverse ﬁltering has been proposed [7, 8].However, previous studies based on glottal ﬂow pulses extractedfrom natural speech are limited to special purposes such as thegeneration of isolated vowels, and the beneﬁts from combiningautomatic glottal inverse ﬁltering with an HMM-based speechsynthesizer have not been utilized.In this paper, a novel HMM-based speech synthesis sys-tem that utilizes glottal inverse ﬁltering for generating naturalsounding synthetic speech is presented. The rest of the paper isorganized as follows: Section 2 describes the proposed speechsynthesis system. The results of the experiments with the newsynthesizer are presented in Section 3. Discussion on the pro-posed speech synthesis system and future plans are presented inSection 4, and ﬁnal conclusions are presented in Section 5.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

HMM-based Finnish text-to-speech system utilizing glottal inverse filtering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An HMM-Based Approach to Flexible Speech Synthesis
Keiichi Tokuda
-
Keiichi TokudaKeiichi Tokuda
01 Jan 2006
01 Jan 2006

Objective evaluation of HMM-based speech synthesis system using kullback-leibler divergence
C -T Do ... J -L Crebouw
-
C -T Do, et. al.C -T Do ... J -L Crebouw
14 Sep 2014
14 Sep 2014

Factored Maximum Penalized Likelihood Kernel Regression for HMM-Based Style-Adaptive Speech Synthesis
June Sig Sung ... Doo Hwa Hong
IEEE Journal of Selected Topics in Signal Processing | VOL. 8
June Sig Sung, et. al.June Sig Sung ... Doo Hwa Hong
01 Apr 2014
IEEE Journal of Selected Topics in Signal Processing | VOL. 8

Improving the performance of HMM-based voice conversion using context clustering decision tree and appropriate regression matrix format
Long Qin ... Zhen-Hua Ling
-
Long Qin, et. al.Long Qin ... Zhen-Hua Ling
17 Sep 2006
17 Sep 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HMM-based Finnish text-to-speech system utilizing glottal inverse filtering

Abstract

Talk to us

Similar Papers