Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering

Shin'Ya Nakajima

doi:10.1016/0167-6393(94)90025-6

Abstract

In this paper, we propose a new synthesis unit learning method aiming at multi-lingual speech synthesis and describe its application to English speech synthesis. The method termed Multi-Layered Context Oriented Clustering (ML-COC) is a generalized framework of the COC method which has been applied to Japanese speech synthesis. The conventional COC method produces a set of phonetic context dependent units through a cluster splitting process. In ML-COC, the notion of context is generalized and the factors other than phonetic context, such as stressing and syntactical boundaries, are taken into account to capture the richer phoneme variations of English. A synthesis unit generation experiment shows that ML-COC produces about three times as many synthesis units as the conventional COC (Single-Layered COC: SL-COC) method, and the average intra-cluster variance of ML-COC units is 20% lower than that of SL-COC. These results suggest that the ML-COC synthesis units reflect the phonological structure of English much more appropriately than do the SL-COC units. To validate the effectiveness of the ML-COC method, we conducted preference experiments using synthesized speech. The preference test exposed 10 subjects to 52 sentences. The ML-COC method was preferred over the conventional SL-COC method by a score of 70% to 30%.

Full Text