Abstract

This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional ‘hard’ decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this ‘divide-and-conquer’ approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.

Highlights

  • Demand for natural and high-quality speech-based human-computer interaction is increasing due to applications including speech-based virtual assistants for mobile devices

  • In the hard decision tree structure, each acoustic feature vector is associated with modeling only one contextual cluster, and it is the main reason of poor generalization

  • In order to alleviate this problem, the capability of exploiting soft questions was added to the conventional decision tree architecture

Read more

Summary

Introduction

Demand for natural and high-quality speech-based human-computer interaction is increasing due to applications including speech-based virtual assistants for mobile devices. Conventional HMM-based speech synthesis converts all nonbinary contextual factors to multiple binary questions (i.e., potential decision tree splits) As mentioned earlier, this structure may suffer from inadequate context generalization. In contrast to a hard decision tree that partitions contextual factor space into hard contextual regions, the proposed soft decision tree is able to provide soft - i.e., overlapping - clusters In this structure, each context will be assigned to several terminal leaves with certain membership functions, and each training sample affects multiple model parameters, and generalization should be improved. 2.1 F0 modeling in the HMM framework Typically, F0 along with its delta and delta-delta derivatives form three streamsa of a context-dependent [34,35] multi-space probability distribution (MSD) [36] left-toright without skip transitions HSMM [58,37] (which for obvious reasons, we shorten to ‘HMM’ in this paper) This model structure generates acoustic trajectories of a unit (e.g., phoneme) by emitting observations from hidden states. A more efficient version of the forward-backward algorithm has recently been proposed by Yu et al [65]

HMM parameter re-estimation
Maximum entropy-based distributions
X 1 t lgtl
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call