Abstract

Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.

Highlights

  • Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade [1,2]

  • We expect that the proposed method could be comparable with hidden semi-Markov models (HSMMs) or outperform it even for large databases if we apply more detailed and well-designed features. From these figures and the illustratory example presented before, we can see that when the available data are limited, all features of synthetic speech generated by hidden maximum entropy model (HMEM) are closer to the original features than those obtained with HSMM

  • This paper addressed the main shortcomings of HSMM in context-dependent acoustic modeling, namely inadequate context generalization

Read more

Summary

Introduction

Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade [1,2]. This additive structure may not match training data accurately because once training is done, the first and second moments of the training data and model may not be exactly the same in some regions Another important problem of conventional decision tree-clustered acoustic modeling is difficulty in capturing the effect of weak contextual factors such as word-level emphasis [23,36]. The overall idea of this research is to improve HSMM context generalization by taking advantage of a distribution which matches training data in many overlapped contextual regions and is optimum in the sense of an entropy criterion This system has the potential to model the dependencies between contextual factors and acoustic features such that each training sample contributes to train multiple sets of model parameters.

N X N Xt
PðojλÞ
HMEM parameter re-estimation
Decision tree-based context clustering
Objective evaluation
Performance evaluation of HMEM with decision tree-based context clustering
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.