Abstract

By highlighting the focus of an utterance to draw attention, emphasis in speech interaction plays an important role for speaker intention expressing and understanding. Therefore, emphatic speech synthesis draws increasing interest in the text-to-speech (TTS) area. For emphatic speech synthesis, three problems still exist: 1) sparseness of emphatic speech data; 2) flexibility of trained model; 3) modelling shortage for secondary emphasis. Recently, recurrent neural networks (RNNs) and their bidirectional long short term memory (BLSTM) variants based statistical parametric speech synthesis (SPSS) systems have shown their adaptability and controllability in acoustic modelling thus can solve aforementioned problems. In this paper, we propose a novel conditional input layer for conventional BLSTM-RNN based approach combining using emphasis-specific vectors and linguistic features as input to produce emphatic speech trajectories. Experimental results from objective and subjective evaluations demonstrate the proposed approach can produce emphatic speech trajectories with high quality and naturalness only requiring an additional small-scale emphatic speech corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call