Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis

Masashi Aso,Shinnosuke Takamichi,Norihiro Takamune,Hiroshi Saruwatari

doi:10.1016/j.specom.2020.09.003

Masashi Aso, Shinnosuke Takamichi + Show 2 more

Open Access

https://doi.org/10.1016/j.specom.2020.09.003

Copy DOI

Journal: Speech Communication	Publication Date: Sep 24, 2020
Citations: 11	License type: cc-by

Affiliation: The University of Tokyo

Abstract

This paper presents text tokenization and context extraction without using language knowledge for text-to-speech (TTS) synthesis. To generate prosody, statistical parametric TTS synthesis typically requires the professional knowledge of the target language. Therefore, languages suitable for TTS synthesis are limited to only rich-resource languages. To achieve TTS synthesis without using language knowledge, we propose acoustic model-based subword tokenization and unsupervised extraction of prosodic contexts. The subword tokenization can determine language units suitable for prosody generation. The context extraction can retrieve contexts from pairs of subwords and prosody. The proposed methods function without language knowledge and can improve F0 prediction accuracy. Experimental evaluation demonstrates that 1) the training of proposed subword tokenization, which uses the expectation-maximization algorithm and deep neural networks, is empirically stable, 2) the proposed subword tokenization tokenizes text into subwords that are close to language-specific units, and 3) the proposed methods outperform the conventional methods using language model-based tokenization in terms of synthetic speech quality.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis

Abstract

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Similar Papers

On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis
Yao Qian ... Frank K Soong
-
Yao Qian, et. al.Yao Qian ... Frank K Soong
01 May 2014
01 May 2014

TTS synthesis with bidirectional LSTM based recurrent neural networks
Yuchen Fan ... Feng-Long Xie
-
Yuchen Fan, et. al.Yuchen Fan ... Feng-Long Xie
14 Sep 2014
14 Sep 2014

Multilingual data selection for training stacked bottleneck features
Ekapol Chuangsuwanich ... James Glass
-
Ekapol Chuangsuwanich, et. al.Ekapol Chuangsuwanich ... James Glass
01 Mar 2016
01 Mar 2016

Bangla Fake News Detection using Machine Learning, Deep Learning and Transformer Models
Risul Islam Rasel ... Mohammed Moshiul Hoque
-
Risul Islam Rasel, et. al.Risul Islam Rasel ... Mohammed Moshiul Hoque
17 Dec 2022
17 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis

Abstract

Talk to us

Similar Papers

More From: Speech Communication