A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Ling-Hui Chen,Tuomo Raitio,Zhen-Hua Ling,Cassia Valentini-Botinhao,Junichi Yamagishi

doi:10.1109/taslp.2015.2461448

Abstract

The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds muffled. One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.

Highlights

S TATISTICAL parametric speech synthesis is one of the most popular methods of speech synthesis due to its flexibility and compact footprint [2]
We found that the proposed deep neural network (DNN)-based postfilter in the spectral domain produced synthetic speech that was of higher quality than that obtained with any conventional postfilters
We proposed a data-driven postfilter technique to improve the segmental quality of statistical parametric text-to-speech synthesis

Summary

Introduction

S TATISTICAL parametric speech synthesis is one of the most popular methods of speech synthesis due to its flexibility and compact footprint [2]. Statistical parametric speech synthesizers have been found to be as intelligible as natural human speech several times at the annual evaluation events of corpus-based speech synthesis systems called “Blizzard Challenge” [3]. It is known, that synthesised speech generated from statistical models still sounds “muffled” compared to natural speech. That synthesised speech generated from statistical models still sounds “muffled” compared to natural speech This is often attributed to the fact. Deep neural networks (DNNs) with many hidden layers have been actively investigated to improve the quality of synthetic speech and several significant improvements have been reported. Recurrent neural networks (RNNs) with long-short term memories (LSTMs) have been used for prosody modelling [10] and acoustic trajectory modelling [11], [12]

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Nov 1, 2015
Citations: 75	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Deep belief network-based post-filtering for statistical parametric speech synthesis
Ya-Jun Hu ... Li-Rong Dai
-
Ya-Jun Hu, et. al.Ya-Jun Hu ... Li-Rong Dai
01 Mar 2016
01 Mar 2016

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis
Zhen-Hua Ling ... Li Deng
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 21
Zhen-Hua Ling, et. al.Zhen-Hua Ling ... Li Deng
01 Oct 2013
IEEE Transactions on Audio, Speech, and Language Processing | VOL. 21

HMM-based Finnish text-to-speech system utilizing glottal inverse filtering
Tuomo Raitio ... Martti Vainio
-
Tuomo Raitio, et. al.Tuomo Raitio ... Martti Vainio
22 Sep 2008
22 Sep 2008

A Hybrid Hidden Markov Model for Pipeline Leakage Detection
Mingchi Zhang ... Wei Li
Applied Sciences | VOL. 11
Mingchi Zhang, et. al.Mingchi Zhang ... Wei Li
01 Apr 2021
Applied Sciences | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing