Abstract
The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds muffled. One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.
Highlights
S TATISTICAL parametric speech synthesis is one of the most popular methods of speech synthesis due to its flexibility and compact footprint [2]
We found that the proposed deep neural network (DNN)-based postfilter in the spectral domain produced synthetic speech that was of higher quality than that obtained with any conventional postfilters
We proposed a data-driven postfilter technique to improve the segmental quality of statistical parametric text-to-speech synthesis
Summary
S TATISTICAL parametric speech synthesis is one of the most popular methods of speech synthesis due to its flexibility and compact footprint [2]. Statistical parametric speech synthesizers have been found to be as intelligible as natural human speech several times at the annual evaluation events of corpus-based speech synthesis systems called “Blizzard Challenge” [3]. It is known, that synthesised speech generated from statistical models still sounds “muffled” compared to natural speech. That synthesised speech generated from statistical models still sounds “muffled” compared to natural speech This is often attributed to the fact. Deep neural networks (DNNs) with many hidden layers have been actively investigated to improve the quality of synthetic speech and several significant improvements have been reported. Recurrent neural networks (RNNs) with long-short term memories (LSTMs) have been used for prosody modelling [10] and acoustic trajectory modelling [11], [12]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.