Abstract

Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Highlights

  • Deep learning techniques are widely used in TTS systems due to their ability to generate higher quality synthesized speech than traditional methods

  • We compare the performance of various TTS models and found that multi-speaker models were effective as intermediate models when constructing a single-speaker, low-resource TTS model

  • We trained some models using only transfer learning and some using only data augmentation, to evaluate how each method affected the naturalness of the speech output by the TTS model

Read more

Summary

Introduction

Deep learning techniques are widely used in TTS systems due to their ability to generate higher quality synthesized speech than traditional methods. Recent end-to-end neural models such as Tacotron [1], Tacotron 2 [2], Deep Voice 3 [3], and Char2Wav [4] are Byambadorj et al EURASIP Journal on Audio, Speech, and Music Processing (2021) 2021:42 have proposed a variety of techniques which can be used for TTS with low-resource languages These techniques include: Monolingual transfer learning: When there is only a small dataset of a particular type of speech available, such as the speech of an additional speaker, emotional speech data, and alternative speaking style data, a pre-trained model, trained using a large amount of a different type of speech data, can be used as a low-resource speech model by using transfer learning. In studies [6] and [7], all of the datasets used were in the same language

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call