Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis

Runnan Li,Lianhong Cai,Zhiyong Wu,Helen Meng,Yuchen Huang,Jia Jia

doi:10.1109/icassp.2018.8461748

Abstract

By highlighting the focus of an utterance to draw attention, emphasis in speech interaction plays an important role for speaker intention expressing and understanding. Therefore, emphatic speech synthesis draws increasing interest in the text-to-speech (TTS) area. For emphatic speech synthesis, three problems still exist: 1) sparseness of emphatic speech data; 2) flexibility of trained model; 3) modelling shortage for secondary emphasis. Recently, recurrent neural networks (RNNs) and their bidirectional long short term memory (BLSTM) variants based statistical parametric speech synthesis (SPSS) systems have shown their adaptability and controllability in acoustic modelling thus can solve aforementioned problems. In this paper, we propose a novel conditional input layer for conventional BLSTM-RNN based approach combining using emphasis-specific vectors and linguistic features as input to produce emphatic speech trajectories. Experimental results from objective and subjective evaluations demonstrate the proposed approach can produce emphatic speech trajectories with high quality and naturalness only requiring an additional small-scale emphatic speech corpus.

Full Text