Abstract

The quality of text-to-speech (TTS) synthesis systems can be improved by controlling the intensities of speech segments in addition to durations and intonation. This paper proposes linguistic and production constraints for modeling the intensity patterns of sequence of syllables. Linguistic constraints are represented by positional, contextual and phonological features, and production constraints are represented by articulatory features associated to syllables. In this work, feedforward neural network (FFNN) is proposed to model the intensities of syllables. The proposed FFNN model is evaluated by means of objective measures such as average prediction error (μ), standard deviation (σ), correlation coefficient (γ X,Y ) and the percentage of syllables predicted within different deviations. The prediction performance of the proposed model is compared with other statistical models such as Linear Regression (LR) and Classification and Regression Tree (CART) models. The models are also evaluated by means of subjective listening tests on the synthesized speech generated by incorporating the predicted syllable intensities in Bengali TTS system. From the evaluation studies, it is observed that prediction accuracy is better for FFNN models, compared to other models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.