Utterance copy through analysis-by-synthesis using genetic algorithm

Fabíola Araújo,Aldebaro Klautau,José Filho

doi:10.1186/s13173-015-0037-9

Fabíola Araújo, Aldebaro Klautau + Show 1 more

Open Access

https://doi.org/10.1186/s13173-015-0037-9

Copy DOI

Abstract

Utterance copy consists in estimating the input parameters to reconstruct a speech signal using a speech synthesizer. This process is distinct from the more traditional text-to-speech but yet used in many areas, especially in linguistics and health. Utterance copy is a difficult inverse problem because the mapping is non-linear and from many to one. It requires considerable amount of time to manually perform utterance copy and automatic methods, such as the one proposed here, are of interest. This work presents our system based on genetic algorithm (GA) to automatically estimate the input parameters of the Klatt synthesizer using an analysis-by-synthesis process. Results are presented for synthetic (computer-generated) and natural (human-generated) speech, for male and female speakers. These results are compared with the ones obtained with WinSnoori, the only currently available software that performs the same task. The experiments showed that the proposed newGASpeech system is an effective alternative to the laborious manual process of estimating the input parameters of a Klatt synthesizer. And it outperforms the baseline by a large margin with respect to five objective figures of merit. For example, in average, the mean squared error is reduced to approximately 60.4 % and 75.2 % when natural target voices from male and female speakers are used, respectively.

Highlights

Utterance copy consists in estimating the input parameters to reconstruct a speech signal using a speech synthesizer
Utterance copy recent experiments were performed in two ways: first to assess the genetic algorithm (GA) convergence regarding the dimensionality of the search space and second to compare the synthesized male and female voices obtained by the newGASpeech and the WinSnoori using as target natural and synthetic voices
The target and synthetic signals were aligned according to their cross-correlation and the results evaluated using the metrics: signal-to-noise ratio (SNR), root-mean-square error (RMSE), DLE, Perceptual Evaluation of Speech Quality (PESQ), and P.563

Summary

Introduction

Utterance copy consists in estimating the input parameters to reconstruct a speech signal using a speech synthesizer. The task is that, given a target utterance (a sentence, word or phoneme spoken by the person of interest), one has to find the set of parameters that, when used as the input of a synthesizer, generates an artificial voice that resembles the target one. This task can be done manually, by trial-and-error, or automatically. Due to the difficulty of an objective evaluation of the synthetic voices, several complementary figures of merit were adopted, namely: the log-spectral distance (DLE), signal-to-noise ratio (SNR) [6], root-mean-square error (RMSE), Perceptual Evaluation of Speech Quality (PESQ) [7], and P.563, a single-ended method for objective speech quality [8]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of the Brazilian Computer Society	Publication Date: Oct 8, 2015
Citations: 11	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Utterance copy through analysis-by-synthesis using genetic algorithm

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of the Brazilian Computer Society

Lead the way for us

Similar Papers

Difference(s) between Male and Female Speakers of Turkish Regarding Politeness Norms
Taher Alavi ... Siamak Moradi
Journal of Language Teaching and Research | VOL. 4
Taher Alavi, et. al.Taher Alavi ... Siamak Moradi
01 Nov 2013
Journal of Language Teaching and Research | VOL. 4

Influence of Speaker Gender on Listener Judgments of Tracheoesophageal Speech
Tanya L Eadie ... Paul G Beaudin
Journal of Voice | VOL. 22
Tanya L Eadie, et. al.Tanya L Eadie ... Paul G Beaudin
18 Oct 2006
Journal of Voice | VOL. 22

How neurotypical listeners recognize emotions expressed through vocal cues by speakers with high-functioning autism.
Mindy T Gibson ... Silke Paulmann
PLOS ONE | VOL. 18
Mindy T Gibson, et. al.Mindy T Gibson ... Silke Paulmann
24 Oct 2023
PLOS ONE | VOL. 18

The German hearing in noise test with a female talker: development and comparison with German male speech test
Anna-Lena Mönnich ... Andrea Bohnert
European Archives of Oto-Rhino-Laryngology | VOL. 280
Anna-Lena Mönnich, et. al.Anna-Lena Mönnich ... Andrea Bohnert
12 Jan 2023
European Archives of Oto-Rhino-Laryngology | VOL. 280

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Utterance copy through analysis-by-synthesis using genetic algorithm

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of the Brazilian Computer Society