The development of phone duration model in speech synthesis in the Serbian language

Sandra Sovilj-Nikic,Ivan Sovilj-Nikic

doi:10.1109/telfor.2015.7377562

Abstract

Having in mind the importance of segmental duration from the perceptual point of view, the possibility of automatic prediction of natural duration of phones is essential for achieving the naturalness of synthetic speech. In this paper various machine learning techniques were used for phone duration modeling of the Serbian language. In this paper different phone duration models for the Serbian language using linear regression, tree-based algorithms and meta-learning algorithms such as additive regression, bagging and stacking algorithm are presented. Phone duration models have been developed for the full phoneme set of the Serbian language as well as for vowels and consonants separately. A large speech corpus and a feature set of 21 parameters describing phones and their contexts were used for segmental duration prediction. These features have been extracted from the large speech database for the Serbian language. The phone duration model obtained using additive regression method outperformed the other models developed for the Serbian language which are also presented in this paper. The results obtained for the full phoneme set as well as for consonants and vowels are comparable with or even outperform those reported in the literature for Czech, Greek, English, Lithuanian, Korean, Turkish and Indian languages Hindi and Telugu.

Full Text