Abstract

The paper focuses on the description of a system for the automatic evaluation of synthetic speech quality based on the Gaussian mixture model (GMM) classifier. The speech material originating from a real speaker is compared with synthesized material to determine similarities or differences between them. The final evaluation order is determined by distances in the Pleasure-Arousal (P-A) space between the original and synthetic speech using different synthesis and/or prosody manipulation methods implemented in the Czech text-to-speech system. The GMM models for continual 2D detection of P-A classes are trained using the sound/speech material from the databases without any relation to the original speech or the synthesized sentences. Preliminary and auxiliary analyses show a substantial influence of the number of mixtures, the number and type of the speech features used the size of the processed speech material, as well as the type of the database used for the creation of the GMMs on the P-A classification process and on the final evaluation result. The main evaluation experiments confirm the functionality of the system developed. The objective evaluation results obtained are principally correlated with the subjective ratings of human evaluators; however, partial differences were indicated, so a subsequent detailed investigation must be performed.

Highlights

  • At present, many different subjective and objective methods and criteria for quality evaluation of synthetic speech produced by text-to-speech (TTS) systems are used

  • Our current research focuses on the development of an automatic system for the quality evaluation of synthetic speech in the Czech language using different synthesis methods

  • To evaluate synthetic speech quality by continual classification in the P-A scale, we collected the first speech corpus (SC1) consisting of three parts: the original speech uttered by real speakers, and two variations of speech synthesis produced by the Czech TTS system using the unit selection (USEL) method [16] with voices based on the original speaker

Read more

Summary

Introduction

Many different subjective and objective methods and criteria for quality evaluation of synthetic speech produced by text-to-speech (TTS) systems are used. The conventional listening tests usually involve a comparison category rating on a scale from “much better” to “much worse” than high-quality reference speech [1]. The subjective scales for rating the synthesized speech may include only a few scored parameters, such as an overall impression by a mean opinion score (MOS) describing the perceived speech quality from poor to excellent, a valence from negative to positive, and an arousal from unexcited to excited [4]. For objective speech quality estimation of the TTS voice, various speech features extracted from the natural and synthetic speech are evaluated. The synthetic speech quality may be predicted by a mix of several prosodic properties (slope of F0, F0 range, jitter, shimmer, vocalic durations, intervocalic durations) and articulation-associated properties (discrete-cosine-transform coefficients of the mel-cepstrum, their delta, and delta-delta values) [2]

Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call