Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Weizhao Zhang,Xiaolong Bu,Hongwu Yang,Lili Wang

doi:10.1109/access.2019.2954342

Weizhao Zhang, Xiaolong Bu + Show 2 more

Open Access

https://doi.org/10.1109/access.2019.2954342

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 20	License type: CC BY 4.0

Affiliation: Northwest Normal University

Abstract

This paper proposes a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin speech synthesis and Tibetan speech synthesis under a unique framework. Because Tibetan training corpus is hard to record, we train the acoustic models with a large scale Mandarin multi-speaker corpus and a small scale Tibetan one-speaker corpus. The acoustic models are trained with deep neural network (DNN), hybrid long short-term memory (LSTM), and hybrid bi-directional long short-term memory (BLSTM). We also further extend our Chinese text analyzer by adding a Tibetan text analyzer for generating context-dependent labels from input Chinese or Tibetan sentences. The Tibetan text analyzer includes a text normalization, a novel Tibetan word segmentation that combines a BLSTM with conditional random field, a prosodic boundary prediction, and a grapheme-to-phoneme conversion. We select the initials and the finals of both Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from Mandarin and Tibetan mixed corpus. Then the speaker adaptation is applied to train speaker-dependent DNN, hybrid LSTM, or hybrid BLSTM models of Mandarin or Tibetan with a small target speaker corpus from an AVM. Finally, we synthesize the Mandarin speech, or Tibetan speech though the speaker-dependent Mandarin or Tibetan models. The experiments show that the hybrid BLSTM-based cross-lingual speech synthesis framework is better than the other two cross-lingual frameworks and the Tibetan monolingual framework. The mixed Tibetan training corpus does not influence the voice quality of synthesized Mandarin speech. Furthermore, the hybrid BLSTM-based cross-lingual speech synthesis framework only needs 60% of the training corpus to synthesize a similar voice as the Tibetan monolingual framework. Therefore, the proposed method can be used for speech synthesis of low resource languages by borrowing the same tremendous resource language's corpus.

Highlights

In recent years, cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]
TIBETAN TEXT ANALYSIS We further developed a Tibetan text analyzer based on the Mandarin text analyzer we have realized for the MandarinTibetan cross-lingual speech synthesis
The acoustic model of the hybrid bi-directional long short-term memory (BLSTM)-based TTS framework is better than other frameworks

Summary

INTRODUCTION

Cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]. DNN-based speaker adaptation [22], [23] has been adopted in various neural networks-based TTS applications [20], [21], VOLUME 7, 2019 the adaptation method has not been applied to the MandarinTibetan cross-lingual speech synthesis To address these limitations, we adopt a deep learning-based framework to realize the Mandarin-Tibetan cross-lingual speech synthesis. We use the initials and the finals of Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from a large Mandarin multi-speaker corpus and a small Tibetan one-speaker corpus. LSTM can effectively capture long-term temporal dependencies It has been widely used in natural language processing and acoustic modeling for speech synthesis. The WORLD [26] vocoder is used to generate the speech waveforms from the acoustic parameters

TIBETAN TEXT ANALYSIS

PROSODIC BOUNDARY PREDICTION

TIBETAN GRAPHEME-TO-PHONEME CONVERSION

DISCUSSION

Findings

VIII. CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Gujarati Task Oriented Dialogue Slot Tagging Using Deep Neural Network Models
Rachana Parikh ... Hiren Joshi
-
Rachana Parikh, et. al.Rachana Parikh ... Hiren Joshi
01 Jan 2020
01 Jan 2020

A DNN-based Mandarin-Tibetan cross-lingual speech synthesis
Weitong Guo ... Zhenye Gan
-
Weitong Guo, et. al.Weitong Guo ... Zhenye Gan
01 Nov 2018
01 Nov 2018

Recognition and extraction of named entities in online medical diagnosis data based on a deep neural network
Xin Liu ... Zongrun Wang
Journal of Visual Communication and Image Representation | VOL. 60
Xin Liu, et. al.Xin Liu ... Zongrun Wang
01 Feb 2019
Journal of Visual Communication and Image Representation | VOL. 60

A Comparative Study on Various Deep Learning Techniques for Arabic NLP Syntactic Tasks on Noisy Data
Shaima A Abushaala ... Mohammed M Elsheh
-
Shaima A Abushaala, et. al.Shaima A Abushaala ... Mohammed M Elsheh
23 May 2022
23 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access