Soft context clustering for F0 modeling in HMM-based speech synthesis

Soheil Khorram,Hossein Sameti,Simon King

doi:10.1186/1687-6180-2015-2

Abstract

This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional ‘hard’ decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this ‘divide-and-conquer’ approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.

Highlights

Demand for natural and high-quality speech-based human-computer interaction is increasing due to applications including speech-based virtual assistants for mobile devices
In the hard decision tree structure, each acoustic feature vector is associated with modeling only one contextual cluster, and it is the main reason of poor generalization
In order to alleviate this problem, the capability of exploiting soft questions was added to the conventional decision tree architecture

Summary

Introduction

Demand for natural and high-quality speech-based human-computer interaction is increasing due to applications including speech-based virtual assistants for mobile devices. Conventional HMM-based speech synthesis converts all nonbinary contextual factors to multiple binary questions (i.e., potential decision tree splits) As mentioned earlier, this structure may suffer from inadequate context generalization. In contrast to a hard decision tree that partitions contextual factor space into hard contextual regions, the proposed soft decision tree is able to provide soft - i.e., overlapping - clusters In this structure, each context will be assigned to several terminal leaves with certain membership functions, and each training sample affects multiple model parameters, and generalization should be improved. 2.1 F0 modeling in the HMM framework Typically, F0 along with its delta and delta-delta derivatives form three streamsa of a context-dependent [34,35] multi-space probability distribution (MSD) [36] left-toright without skip transitions HSMM [58,37] (which for obvious reasons, we shorten to ‘HMM’ in this paper) This model structure generates acoustic trajectories of a unit (e.g., phoneme) by emitting observations from hidden states. A more efficient version of the forward-backward algorithm has recently been proposed by Yu et al [65]

HMM parameter re-estimation

Maximum entropy-based distributions

X 1 t lgtl

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Advances in Signal Processing	Publication Date: Jan 9, 2015
Citations: 45	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Soft context clustering for F0 modeling in HMM-based speech synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing

Lead the way for us

Similar Papers

Multi-speaker modeling with shared prior distributions and model structures for Bayesian speech synthesis
Kei Hashimoto ... Keiichi Tokuda
-
Kei Hashimoto, et. al.Kei Hashimoto ... Keiichi Tokuda
27 Aug 2011
27 Aug 2011

Rule-Extraction from Soft Decision Trees
Lin Huang ... Mohammad Reza Rajati
-
Lin Huang, et. al.Lin Huang ... Mohammad Reza Rajati
01 Jun 2018
01 Jun 2018

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
Soheil Khorram ... Thomas Drugman
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2014
Soheil Khorram, et. al.Soheil Khorram ... Thomas Drugman
07 Apr 2014
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2014

A complete fuzzy decision tree technique
Cristina Olaru ... Louis Wehenkel
Fuzzy Sets and Systems | VOL. 138
Cristina Olaru, et. al.Cristina Olaru ... Louis Wehenkel
20 Mar 2003
Fuzzy Sets and Systems | VOL. 138

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Soft context clustering for F0 modeling in HMM-based speech synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing