FastSpeechStyle: Fast, Emotion Controllable, and High-Quality Speech Synthesis

Thinh Van Nguyen,Khoa Dang Mac,Nhan Tri Do,Cuong Hung Pham,Khanh Ngoc Minh Nguyen,Vu Tuan Ho

doi:10.1142/s2717554523500078

Abstract

Non-autoregressive text-to-speech models such as Fastspeech2 can fast synthesize high-quality speech. This model also allows explicit control of the speech signal’s pitch, energy, and speed. However, controlling emotion while maintaining natural human-like speech is still a problem. In this work, we propose an expressive speech synthesis model that can synthesize high-quality speech with desired emotion. The proposed model includes two main components (1) Mel Emotion Encoder extracts emotion embedding from the Mel-spectrogram of audio, (2) the FastSpeechStyle, a non-autoregressive model, which is modified from vanilla Fastspeech2. The FastSpeechStyle used an Improved Conformer block, which replaces normal LayerNorm with Style Adaptive LayerNorm to “shift” and “scale” hidden features according to emotion embedding, instead of vanilla FFTBlock [Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, Fastspeech: Fast, robust and controllable text to speech, Adv. Neural Inf. Process. Syst. 32 (2019)] to better model the local and global dependency in the acoustic model. We also propose a specific inference strategy to control the desired emotion of speech. The experimental results show that (1) The proposed model with improved Conformer achieved higher scores than the baseline model in all naturalness and emotion similarity scores; (2) The proposed model still maintains fast inference speed like the baseline model.

Full Text