Abstract

Statistical parametric speech synthesis based on Hidden Markov Models has been an important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different qualities of those sounds and how HMM-based voices can present distinct degradation on each one. The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.

Highlights

  • In the field of speech synthesis, pursuing the creation of artificial voices with natural sound and flexibility, statistical parametric speech synthesis has been a hot topic for researchers for more than a decade [1,2]

  • We present for the first time a discriminative postfiltering, for enhancing the synthesized speech by a group of deep learning networks trained to map the voiced or the unvoiced sounds separately

  • It shows that the speech enhanced by the Discriminative postfilters is significantly preferred than the best HTS and the nondiscriminative postfilters for all voices, with the most notorious differences in the RMS and BDL voices

Read more

Summary

Introduction

In the field of speech synthesis, pursuing the creation of artificial voices with natural sound and flexibility, statistical parametric speech synthesis has been a hot topic for researchers for more than a decade [1,2]. For under-resourced languages or the first development of artificial speech, HMMbased speech synthesis is a technique commonly applied in many cases [5,6,7,8]. Despite the advantages of this technique for speech synthesis, some shortcomings concerning naturalness and overall quality have been mentioned in the many implementations in languages around the world, often referred to as buzzy and muffled sound [9]. The three principal factors that affect the quality of statistical parametric speech synthesis are limitations of the parametric synthesizer itself, the inadequacy of acoustic modeling, and the over-smoothing effect of parameter generation [2]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call