Statistical Parametric Speech Synthesis System Research Articles

Total laryngectomy, i.e., the surgical removal of the larynx, has a profound influence on a patient’s quality of life. The procedure results in a loss of natural voice, which in effect constitutes a significant socio-psychological problem for the patient. The main aim of the study was to develop a statistical parametric speech synthesis system for a patient with laryngeal cancer, on the basis of the patient’s speech samples recorded shortly before the surgery and to check if it was possible to generate speech quality close to that of the original recordings. The recording made use of a representative corpus of the Polish language, consisting of 2150 sentences. The recorded voice proved to indicate dysphonia, which was confirmed by the auditory-perceptual RBH scale (roughness, breathiness, hoarseness) and by acoustical analysis using AVQI (The Acoustic Voice Quality Index). The speech synthesis model was trained using the Merlin repository. Twenty-five experts participated in the MUSHRA listening tests, rating the synthetic voice at 69.4 in terms of the professional voice-over talent recording, on a 0–100 scale, which is a very good result. The authors compared the quality of the synthetic voice to another model of synthetic speech trained with the same corpus, but where a voice-over talent provided the recorded speech samples. The same experts rated the voice at 63.63, which means the patient’s synthetic voice with laryngeal cancer obtained a higher score than that of the talent-voice recordings. As such, the method enabled for the creation of a statistical parametric speech synthesizer for patients awaiting total laryngectomy. As a result, the solution would improve the quality of life as well as better mental wellbeing of the patient.

Read full abstract

We present a series of intelligibility experiments performed on natural and synthetic speech time-compressed at a range of rates and analyze the effect of speech corpus and compression method on the intelligibility scores of sighted and blind individuals. Particularly we are interested in comparing linear and non-linear compression methods applied to normal and fast speech of different speakers. We recorded English and German language voice talents reading prompts at a normal and a fast rate. To create synthetic voices we trained a statistical parametric speech synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to the generated speech waveform. Word recognition results for the English voices show that generating speech at a normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, particularly when applied to the fast data. Phonemic level annotation of the normal and fast databases showed that the German speaker was able to reproduce speech at a fast rate with fewer deletion and substitution errors compared to the English speaker, supporting the intelligibility benefits observed when compressing his fast speech. This shows that the use of fast speech data to create faster synthetic voices does not necessarily lead to more intelligible voices as results are highly dependent on how successful the speaker was at speaking fast while maintaining intelligibility. Linear compression applied to normal rate speech can more reliably provide higher intelligibility, particularly at ultra fast rates.

Read full abstract

Statistical Parametric Speech Synthesis System Research Articles

Related Topics

Articles published on Statistical Parametric Speech Synthesis System

Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis Trained with Noise-Degraded Data Supported by the University of Costa Rica

Implementing a Statistical Parametric Speech Synthesis System for a Patient with Laryngeal Cancer.

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

A Review of Deep Learning Based Speech Synthesis

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구*

Synthesis of Tongue Motion and Acoustics From Text Using a Multimodal Articulatory Database

Time-domain deterministic plus noise model based hybrid source modeling for statistical parametric speech synthesis

Intelligibility of time-compressed synthetic speech: Compression method and speaking style

Enhancing the Intelligibility of Statistically Generated Synthetic Speech by Means of Noise-Independent Modifications

Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise

The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Statistical Parametric Speech Synthesis System Research Articles

Related Topics

Articles published on Statistical Parametric Speech Synthesis System

Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis Trained with Noise-Degraded Data Supported by the University of Costa Rica

Implementing a Statistical Parametric Speech Synthesis System for a Patient with Laryngeal Cancer.

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

A Review of Deep Learning Based Speech Synthesis

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구*

Synthesis of Tongue Motion and Acoustics From Text Using a Multimodal Articulatory Database

Time-domain deterministic plus noise model based hybrid source modeling for statistical parametric speech synthesis

Intelligibility of time-compressed synthetic speech: Compression method and speaking style

Enhancing the Intelligibility of Statistically Generated Synthetic Speech by Means of Noise-Independent Modifications

Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise

The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006