Abstract

We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual features are well designed for training. F0 and spectrum of singing voice are simultaneously modeled with context-dependent HMMs. There is a new problem, F0 of singing voice is always sparse because of large amount of context, i.e., tempo and pitch of note, key, time signature and etc. So the features hardly ever appeared in the training data cannot be well obtained. To address this problem, difference between F0 of singing voice and that of musical score (DF0) is modeled by a single Viterbi training. To overcome the over-smoothing of the generated F0 contour, syllable level F0 model based on discrete cosine transforms (DCT) is applied, F0 contour is generated by integrating two-level statistical models. The experimental results demonstrate that the proposed system outperforms the baseline system in both objective and subjective evaluations. The proposed system can generate a more natural F0 contour. Furthermore, the syllable level F0 model can make singing voice more expressive.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call