Abstract
Non-autoregressive text-to-speech models such as Fastspeech2 can fast synthesize high-quality speech. This model also allows explicit control of the speech signal’s pitch, energy, and speed. However, controlling emotion while maintaining natural human-like speech is still a problem. In this work, we propose an expressive speech synthesis model that can synthesize high-quality speech with desired emotion. The proposed model includes two main components (1) Mel Emotion Encoder extracts emotion embedding from the Mel-spectrogram of audio, (2) the FastSpeechStyle, a non-autoregressive model, which is modified from vanilla Fastspeech2. The FastSpeechStyle used an Improved Conformer block, which replaces normal LayerNorm with Style Adaptive LayerNorm to “shift” and “scale” hidden features according to emotion embedding, instead of vanilla FFTBlock [Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, Fastspeech: Fast, robust and controllable text to speech, Adv. Neural Inf. Process. Syst. 32 (2019)] to better model the local and global dependency in the acoustic model. We also propose a specific inference strategy to control the desired emotion of speech. The experimental results show that (1) The proposed model with improved Conformer achieved higher scores than the baseline model in all naturalness and emotion similarity scores; (2) The proposed model still maintains fast inference speed like the baseline model.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Asian Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.