Abstract

A review of existing multilingual TTS (text-to-speech) systems shows that the secondary language inserted into the primary language sounds more like isolated individual words in an alien language environment and not congruous with the primary language's prosody. Since the letter-by-letter spelling of English words or acronyms appears in Chinese speech quite often, a duration modeling approach for English letters embedded in Chinese speech is proposed to make the English congruous with the primary language's tempo. It takes several major factors as additive factors and estimates all model parameters by an EM (expectation-maximization) algorithm. Experimental results showed that the standard deviation of the duration from the test set was greatly reduced from 59.82 to 9.37 ms by the duration modeling while eliminating effects from factors. The root mean squared error between the original and estimated durations was 9.35 ms for the open tests. Experimental results have confirmed its effectiveness on isolating several main factors that seriously affects the duration. Moreover, the estimated value of the factors agreed well to our prior linguistic knowledge. Besides, the hidden state labels produced by the EM algorithm were linguistically meaningful.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call