An HMM-based singing voice synthesis system

Keijiro Saino,Keiichi Tokuda,Yoshihiko Nankaku,Akinobu Lee,Heiga Zen

doi:10.21437/interspeech.2006-584

Abstract

Abstract The present paper describes a corpus-based singing voice syn-thesis system based on hidden Markov models (HMMs). Thissystem employs the HMM-based speech synthesis to synthesizesingingvoice. Musical information such aslyrics, tones, durationsis modeled simultaneously in a uniﬁed framework of the context-dependent HMM. It can mimic the voice quality and singing styleof the original singer. Results of a singing voice synthesis exper-iment show that the proposed system can synthesize smooth andnatural-sounding singing voice. Index Terms : singing voice synthesis, HMM, time-lag model. 1. Introduction In recent years, various applications of speech synthesis systemshave been proposed and investigated. Singing voice synthesis isone of the hot topics in this area [1–5]. However, only a fewcorpus-based singing voice synthesis systems which can be con-structed automatically have been proposed.Currently, there are two main paradigms in the corpus-basedspeech synthesis area: sample-based approach and statistical ap-proach. The sample-based approach such as unit selection [6]can synthesize high-quality speech. However, it requires a hugeamountoftrainingdatatorealizevariousvoicecharacteristics. Onthe other hand, the quality of statistical approach such as HMM-basedspeechsynthesis[7]isbuzzybecauseitisbasedonavocod-ingtechnique. However,itissmoothandstable,anditsvoicechar-acteristics can easily be modiﬁed by transforming HMM parame-ters appropriately. For singing voice synthesis, applying the unitselection seems to be difﬁcult because a huge amount of singingspeech which covers vast combinations of contextual factors thataffect singing voice has to be recorded. On the other hand, theHMM-based system can be constructed using a relatively smallamount of training data. From this point of view, the HMM-basedapproach seems to be more suitable for the singing voice synthe-sizer. In the present paper, we apply the HMM-based synthesisapproach to singing voice synthesis.Although the singing voice synthesis system proposed in thepresent paper is quite similar to the HMM-based text-to-speechsynthesissystem[7],therearetwomaindifferencesbetweenthem.In the HMM-based text-to-speech synthesis system, contextualfactors which may affect reading speech (e.g. phonemes, sylla-bles, words, phrases, etc.) are taken into account. However, con-textual factors which may affect singing voice should be different

Full Text