An Emotion Speech Synthesis Method Based on VITS

Wei Zhao,Zheng Yang

doi:10.3390/app13042225

Abstract

People and things can be connected through the Internet of Things (IoT), and speech synthesis is one of the key technologies. At this stage, end-to-end speech synthesis systems are capable of synthesizing relatively realistic human voices, but the current commonly used parallel text-to-speech suffers from loss of useful information during the two-stage delivery process, and the control features of the synthesized speech are monotonous, with insufficient expression of features, including emotion, leading to emotional speech synthesis becoming a challenging task. In this paper, we propose a new system named Emo-VITS, which is based on the highly expressive speech synthesis module VITS, to realize the emotion control of text-to-speech synthesis. We designed the emotion network to extract the global and local features of the reference audio, and then fused the global and local features through the emotion feature fusion module based on the attention mechanism, so as to achieve more accurate and comprehensive emotion speech synthesis. The experimental results show that the Emo-VITS system’s error rate went up a little bit compared with the network without emotionality and does not affect the semantic understanding. However, this system is superior to other networks in naturalness, sound quality, and emotional similarity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Feb 9, 2023
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Emotion Speech Synthesis Method Based on VITS

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data
Jialin Zhang ... Gulanbaier Tuerhong
Applied Sciences | VOL. 13
Jialin Zhang, et. al.Jialin Zhang ... Gulanbaier Tuerhong
06 May 2023
Applied Sciences | VOL. 13

Emotional speech synthesis and its application to pervasive E-learning
Rui Ren ... Zhenjiang Miao
-
Rui Ren, et. al. Rui Ren ... Zhenjiang Miao
01 Jul 2008
01 Jul 2008

Joint Coding of Local and Global Deep Features in Videos for Visual Search.
Lin Ding ... Yonghong Tian
IEEE Transactions on Image Processing | VOL. 29
Lin Ding, et. al.Lin Ding ... Yonghong Tian
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 29

Weakly Supervised Local-Global Attention Network for Facial Expression Recognition
Haifeng Zhang ... Wen Su
IEEE Access | VOL. 8
Haifeng Zhang, et. al.Haifeng Zhang ... Wen Su
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Emotion Speech Synthesis Method Based on VITS

Abstract

Talk to us

Similar Papers

More From: Applied Sciences