A Review of Deep Learning Based Speech Synthesis

Yishuang Ning,Liang-Jie Zhang,Chunxiao Xing,Zhiyong Wu,Sheng He

doi:10.3390/app9194050

Abstract

Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more attention. Recent advances on speech synthesis are overwhelmingly contributed by deep learning or even end-to-end techniques which have been utilized to enhance a wide range of application scenarios such as intelligent speech interaction, chatbot or conversational artificial intelligence (AI). For speech synthesis, deep learning based techniques can leverage a large scale of <text, speech> pairs to learn effective feature representations to bridge the gap between text and speech, thus better characterizing the properties of events. To better understand the research dynamics in the speech synthesis field, this paper firstly introduces the traditional speech synthesis methods and highlights the importance of the acoustic modeling from the composition of the statistical parametric speech synthesis (SPSS) system. It then gives an overview of the advances on deep learning based speech synthesis, including the end-to-end approaches which have achieved start-of-the-art performance in recent years. Finally, it discusses the problems of the deep learning methods for speech synthesis, and also points out some appealing research directions that can bring the speech synthesis research into a new frontier.

Highlights

IntroductionKnown as text-to-speech (TTS), is a comprehensive technology that involves many disciplines such as acoustics, linguistics, digital signal processing and statistics
Speech synthesis, known as text-to-speech (TTS), is a comprehensive technology that involves many disciplines such as acoustics, linguistics, digital signal processing and statistics.The main task is to convert text input into speech output
The Deep learning (DL)-based speech synthesis models adopt complete context information and distributed representation to replace the clustering process of the context decision tree in hidden Markov models (HMMs), and use multiple hidden layers to map the context features to high-dimensional acoustic features, making the quality of the synthesized speech better than the traditional methods

Summary

Introduction

Known as text-to-speech (TTS), is a comprehensive technology that involves many disciplines such as acoustics, linguistics, digital signal processing and statistics. The main task is to convert text input into speech output. With the development of speech synthesis technologies, from the previous formant based parametric synthesis [1,2], waveform concatenation based methods [3,4,5] to the current statistical parametric speech synthesis (SPSS) [6], the intelligibility and naturalness of the synthesized speech have been improved greatly. The main reason is that the existing methods are based on shallow models that contain only one-layer nonlinear transformation units, such as hidden Markov models (HMMs) [7,8] and maximum Entropy (MaxEnt) [9].

Methods

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Sep 27, 2019
Citations: 93	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Review of Deep Learning Based Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

DNN-based Speaker-adaptive Postfiltering with Limited Adaptation Data for Statistical Speech Synthesis Systems
Mirac Goksu Ozturk ... Okan Ulusoy
-
Mirac Goksu Ozturk, et. al.Mirac Goksu Ozturk ... Okan Ulusoy
01 May 2019
01 May 2019

Research on text analysis for Tibetan statistical parametric speech synthesis
Zhenye Gan ... Xinjie Kong
-
Zhenye Gan, et. al.Zhenye Gan ... Xinjie Kong
01 Oct 2016
01 Oct 2016

Multi-speaker modeling with shared prior distributions and model structures for Bayesian speech synthesis
Kei Hashimoto ... Keiichi Tokuda
-
Kei Hashimoto, et. al.Kei Hashimoto ... Keiichi Tokuda
27 Aug 2011
27 Aug 2011

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis
Runnan Li ... Helen Meng
-
Runnan Li, et. al.Runnan Li ... Helen Meng
01 Apr 2018
01 Apr 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Review of Deep Learning Based Speech Synthesis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences