Abstract

The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding.

Highlights

  • End-to-end text-to-speech (TTS) synthesis models with sequence-to-sequence architectures [1,2,3] have achieved great success in generating naturally sounding speech.To avoid regressive frame-by-frame generation, non-autoregressive TTS models, such as FastSpeech [4] and FastSpeech2 [5], are proposed for fast generation speed

  • We propose using acoustic word embeddings for natural speech synthesis

  • In order to make sure that each acoustic word embedding is well trained, we only considered high-frequency words in the training set

Read more

Summary

Introduction

End-to-end text-to-speech (TTS) synthesis models with sequence-to-sequence architectures [1,2,3] have achieved great success in generating naturally sounding speech. Most of the above end-to-end TTS systems use only phonemes as input tokens and ignore the information about which word the phonemes come from. Similar to the phoneme embeddings that contain the information about how the phonemes are pronounced, our acoustic word embeddings directly indicate how the words are pronounced. Both the phoneme and word sequence are utilized as input to the TTS system and passed through two encoders separately. We carry out subjective evaluations in terms of naturalness, which demonstrate that our proposed system is better than the system that uses only phoneme sequence as input, and better than the prior works that add linguistic word embeddings from pretrained GloVe or BERT

Related Work
FastSpeech2
Objective Evaluation of Phone-Level Prosody Prediction
Acoustic Word Embeddings
Text Normalization
Model Architecture
Experimental Setup
Word Encoder Architectures
Word Frequency Threshold
Naturalness
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call