Abstract
Considering the importance of segmental duration from a perceptive point of view, the possibility of automatic prediction of natural duration of phones is essential for achieving the naturalness of synthesized speech. In this paper phone duration prediction model for the Serbian language using tree-based machine learning approach is presented. A large speech corpus and a feature set of 21 parameters describing phones and their contexts were used for segmental duration prediction. Phone duration modelling is based on attributes such as the current segment identity, preceding and following segment types, manner of articulation (for consonants) and voicing of neighbouring phones, lexical stress, part-of-speech, word length, the position of the segment in the syllable, the position of the syllable in a word, the position of a word in a phrase, phrase break level, etc. These features have been extracted from the large speech database for the Serbian language. The results obtained for the full phoneme set using regression tree, RMSE (root-mean-squared-error) 14.8914 ms, MAE (mean absolute error) 11.1947 ms and correlation coefficient 0.8796 are comparable with those reported in the literature for Czech, Greek, Lithuanian, Korean, Indian languages Hindi and Telugu, Turkish. DOI: http://dx.doi.org/10.5755/j01.eee.20.3.4090
Highlights
In natural speech the duration of speech segments depends on the context of speech, where that dependence is very complex and involves many factors [1]
These algorithms have been used for building binary decision trees on a large speech corpus which contains 98214 phonemes including 38543 vowels and 59671 consonants
It can be noticed that the results achieved using regression tree for the full phoneme set in the Serbian language RMSE 14.8914 ms, mean absolute error (MAE) 11.1947 ms and CC 0.8796 are comparable with or even outperform the results reported in the literature for different languages
Summary
In natural speech the duration of speech segments depends on the context of speech, where that dependence is very complex and involves many factors [1]. Linear statistical models, models obtained using a neural network and models based on decision trees The first such model for predicting the duration of speech segments in American English was developed by Riley [6] using the CART (Classification and Regression Trees) technique. One of the main advantages of the CART method is the ability to find out structural relationships between the predicted and actual values [7] This is the reason why the CART method is commonly used in the initial stages of phone duration modelling research.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.