Unit Selection Speech Synthesis Research Articles

A method of learning and modeling unit embeddings using deep neutral networks (DNNs) is presented in this article for unit-selection-based Mandarin speech synthesis. Here, a unit embedding is defined as a fixed-length embedding vector for a phone-sized unit candidate in a corpus. Modeling phone-sized embedding vectors instead of frame-sized acoustic features can better measure the long-term dependencies among consecutive units in an utterance. First, a DNN with an embedding layer is built to learn the embedding vectors of all unit candidates in the corpus from scratch. In order to enable the extracted embedding vectors to carry both acoustic and linguistic information of unit candidates, a multitarget learning strategy is designed for the DNN. Its optional prediction targets include frame-level acoustic features, unit durations, monophone and tone identifiers, and context classes. Then, another two DNNs are constructed to map linguistic features toward the extracted embedding vectors. One of them employs the unit vectors of preceding phones besides the linguistic features of current phone as its input. At synthesis time, the distances between the unit vectors predicted by these two DNNs and the ones derived from unit candidates are used as a part of the target cost and a part of the concatenation cost, respectively. Our experiments on a Mandarin speech synthesis corpus demonstrate that learning and modeling unit embeddings improve the naturalness of hidden Markov model (HMM)-based unit selection speech synthesis. Furthermore, integrating multiple targets for learning unit embeddings achieves better performance than using only acoustic targets according to our subjective evaluation results.

Read full abstract

We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded – happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.

Read full abstract

Unit Selection Speech Synthesis Research Articles

Related Topics

Articles published on Unit Selection Speech Synthesis

Automatic statistical evaluation of quality of unit selection speech synthesis with different prosody manipulations

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

Unit-Selection Speech Synthesis Method Using Words as Search Units

HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data

A Text to Speech System for Fon Language Using Multisyn Algorithm

Admissible Stopping in Viterbi Beam Search for Unit Selection Speech Synthesis

OPTIMIZATION OF COST FUNCTION WEIGHTS FOR UNIT SELECTION SPEECH SYNTHESIS USING SPEECH RECOGNITION

Multiple f 0 contour parallel Viterbi search for unit selection speech synthesis

Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept

One-Class Classification for Spectral Join Cost Calculation in Unit Selection Speech Synthesis

The Bonn Open Synthesis System 3

Polish unit selection speech synthesis with BOSS: extensions and speech corpora

Implementation of Polish speech synthesis for the BOSS system

Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

Trainable unit selection speech synthesis under statistical framework

Multisyn: Open-domain unit selection for the Festival speech synthesis system

Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis

Optimal Utterance Selection for Unit Selection Speech Synthesis Databases

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Unit Selection Speech Synthesis Research Articles

Related Topics

Articles published on Unit Selection Speech Synthesis

Automatic statistical evaluation of quality of unit selection speech synthesis with different prosody manipulations

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

Unit-Selection Speech Synthesis Method Using Words as Search Units

HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data

A Text to Speech System for Fon Language Using Multisyn Algorithm

Admissible Stopping in Viterbi Beam Search for Unit Selection Speech Synthesis

OPTIMIZATION OF COST FUNCTION WEIGHTS FOR UNIT SELECTION SPEECH SYNTHESIS USING SPEECH RECOGNITION

Multiple f 0 contour parallel Viterbi search for unit selection speech synthesis

Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept

One-Class Classification for Spectral Join Cost Calculation in Unit Selection Speech Synthesis

The Bonn Open Synthesis System 3

Polish unit selection speech synthesis with BOSS: extensions and speech corpora

Implementation of Polish speech synthesis for the BOSS system

Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech

Trainable unit selection speech synthesis under statistical framework

Multisyn: Open-domain unit selection for the Festival speech synthesis system

Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis

Optimal Utterance Selection for Unit Selection Speech Synthesis Databases