Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

Kurniawati Azizah,Wisnu Jatmiko,Mirna Adriani

doi:10.1109/access.2020.3027619

Kurniawati Azizah, Wisnu Jatmiko + Show 1 more

Open Access

https://doi.org/10.1109/access.2020.3027619

Copy DOI

Abstract

This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.

Highlights

Speech is the most natural verbal communication tool that can be understood by normal humans [1]
This section presents the comparison of alignment learning using two training schemes: training from scratch and hierarchical transfer learning schemes for all models
It presents the evaluation of the speech synthesis produced by the TTS models trained using the transfer learning scheme

Summary

Introduction

Speech is the most natural verbal communication tool that can be understood by normal humans [1]. The purpose of building TTS is to produce synthesized speech that can be understood and is indistinguishable from sound produced by real humans [3]. Model-based TTS research has been dominated by statistical parametric speech synthesis (SPSS) [5]–[9] until recent years in which deep learning has delivered extraordinary achievements in various fields [10]–[12].

Methods

Results

Conclusion