Abstract

This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach for modeling unvoiced excitations using a temporal envelope is proposed. Discrete All Pole, Frequency Domain Linear Prediction, Low Pass Filter, and True envelopes are firstly studied and applied to the noise excitation signal in our continuous vocoder. Second, we build a deep learning model based text–to–speech (TTS) which converts written text into human-like speech with a feed-forward and several sequence-to-sequence models (long short-term memory, gated recurrent unit, and hybrid model). Third, a new voice conversion system is proposed using a continuous fundamental frequency to provide accurate time-aligned voiced segments. The results have been evaluated in terms of objective measures and subjective listening tests. Experimental results showed that the proposed models achieved the highest speaker similarity and better quality compared with the other conventional methods.

Highlights

  • Speech synthesis can be defined as the ability to produce human speech by a machine like computer

  • From the objective measurements using Phase Distortion Deviation (PDD), we found that the True and low pass filtering (LPF) envelopes give better results when applied in the continuous vocoder than other envelopes

  • We build a deep learning model based TTS to increase the quality of synthesized speech and train all continuous parameters used by the vocoder

Read more

Summary

Introduction

Speech synthesis can be defined as the ability to produce human speech by a machine like computer. The second goal of this paper is to build a deep learning-based acoustic model for speech synthesis using feedforward and recurrent neural network as an alternative to HMMs. Here, the objective is two-fold: (a) to overcome the limitation of HMMs which typically create over-smoothing and muffled synthesized speech, (b) to ensure that all continuous parameters used by the proposed vocoder were obtained during training that could synthesize very high-quality speech. Modeling of discontinuous fundamental frequency in speech conversion is problematic because the voiced and unvoiced speech regions of the source speaker are typically not appropriately aligned with the target speaker To overcome these limitations, a new model is developed to achieve more natural converted speech

Vocoder description
Noise modeling
True envelope
Discrete all-pole
Frequency domain linear prediction
Low pass filtering
Acoustic modelling within TTS
Deep feed-forward neural network
Recurrent neural network
Speech conversion model
Datasets
Training topology
Phase distortion deviation
RMS – Log spectral distance
Comparison of the WORLD and continuous vocoders
Comparison of the deep learning architectures using empirical measures
Objective metrics for VC
Subjective evaluations
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.