Abstract

In this paper, an end-to-end multi-speaker speech synthesis with controllable stress method is proposed to make the synthetic speech more prominent and expressive, so as to improve the naturalness of synthetic speech. Specifically, we first recorded a small parallel corpus of stress and neutral audio, and labeled the corpora based on three levels of stress: the enhancement of pitch, the stretch of duration, and both. Secondly, based on the multi-speaker acoustic model, the features of the speaker identity and stress are modeled respectively to realize the transfer of stress between different speakers. Finally, we use the LPCNet to convert the spectrum from the target speaker with controllable stress into audio. At the end of the experiment, confusion matrix and Mean Opinion Score (MOS) are used as our evaluation criteria. In addition, we train the basemodel with 100 speakers, so that for any target speaker with only half an hour neutral corpus, can be used to synthesize the stress audio, which greatly improves the efficiency of speech synthesis. Experimental results indicate that the proposed method does not reduce the quality and correctness of synthetic speech, and meanwhile improves the naturalness, expressiveness and similarity of speech synthesis by at least 5% from MOS result.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call