Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

Mohammed Salah Al-Radhi,Géza Németh,Tamás Gábor Csapó

doi:10.1007/s11042-020-09783-9

Abstract

This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach for modeling unvoiced excitations using a temporal envelope is proposed. Discrete All Pole, Frequency Domain Linear Prediction, Low Pass Filter, and True envelopes are firstly studied and applied to the noise excitation signal in our continuous vocoder. Second, we build a deep learning model based text–to–speech (TTS) which converts written text into human-like speech with a feed-forward and several sequence-to-sequence models (long short-term memory, gated recurrent unit, and hybrid model). Third, a new voice conversion system is proposed using a continuous fundamental frequency to provide accurate time-aligned voiced segments. The results have been evaluated in terms of objective measures and subjective listening tests. Experimental results showed that the proposed models achieved the highest speaker similarity and better quality compared with the other conventional methods.

Highlights

Speech synthesis can be defined as the ability to produce human speech by a machine like computer
From the objective measurements using Phase Distortion Deviation (PDD), we found that the True and low pass filtering (LPF) envelopes give better results when applied in the continuous vocoder than other envelopes
We build a deep learning model based TTS to increase the quality of synthesized speech and train all continuous parameters used by the vocoder

Summary

Introduction

Speech synthesis can be defined as the ability to produce human speech by a machine like computer. The second goal of this paper is to build a deep learning-based acoustic model for speech synthesis using feedforward and recurrent neural network as an alternative to HMMs. Here, the objective is two-fold: (a) to overcome the limitation of HMMs which typically create over-smoothing and muffled synthesized speech, (b) to ensure that all continuous parameters used by the proposed vocoder were obtained during training that could synthesize very high-quality speech. Modeling of discontinuous fundamental frequency in speech conversion is problematic because the voiced and unvoiced speech regions of the source speaker are typically not appropriately aligned with the target speaker To overcome these limitations, a new model is developed to achieve more natural converted speech

Vocoder description

Noise modeling

True envelope

Discrete all-pole

Frequency domain linear prediction

Low pass filtering

Acoustic modelling within TTS

Deep feed-forward neural network

Recurrent neural network

Speech conversion model

Datasets

Training topology

Phase distortion deviation

RMS – Log spectral distance

Comparison of the WORLD and continuous vocoders

Comparison of the deep learning architectures using empirical measures

Objective metrics for VC

Subjective evaluations

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Multimedia Tools and Applications	Publication Date: Sep 10, 2020
Citations: 3	License type: open-access

R Discovery Prime

R Discovery Prime

Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Multimedia Tools and Applications

Lead the way for us

Similar Papers

Perceptually Motivated Sub-band Decomposition for FDLP Audio Coding
Petr Motlíček ... Marios Athineos
-
Petr Motlíček, et. al.Petr Motlíček ... Marios Athineos
08 Sep 2008
08 Sep 2008

End-To-End Speech Recognition with Joint Dereverberation of Sub-Band Autoregressive Envelopes
Rohit Kumar ... Anurenjan Purushothaman
-
Rohit Kumar, et. al.Rohit Kumar ... Anurenjan Purushothaman
23 May 2022
23 May 2022

Temporal masking for bit-rate reduction in audio codec based on Frequency Domain Linear Prediction
Sriram Ganapathy ... Harinath Garudadri
-
Sriram Ganapathy, et. al.Sriram Ganapathy ... Harinath Garudadri
01 Mar 2008
01 Mar 2008

Spectral noise shaping: improvements in speech/audio codec based on linear prediction in spectral domain
Sriram Ganapathy ... Hynek Hermansky
-
Sriram Ganapathy, et. al.Sriram Ganapathy ... Hynek Hermansky
22 Sep 2008
22 Sep 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Multimedia Tools and Applications