Abstract

The generation of emotional speech is a challenging and widely applied research topic in the field of speech processing. Because the design method of effective speech feature expression and generation model directly affects the accuracy of emotional speech generation, it is difficult to find a general solution of emotional speech synthesis. In this paper, the CycleGAN model is used as the starting point, and the improved convolution neural network (CNN) model and identity mapping loss scheme are used to achieve effective timing information capture. At the same time, we learn the positive mapping and the reverse mapping to find the best matching design scheme, and retain the speech information in this process, without relying on other audio data. Experiments show that the emotional speech can be accurately recognized by comparing the speech emotion before and after the improvement on the speech corpus of children’s reading. By comparing with the common emotional speech generation model, the advantages of the model proposed in this paper are verified.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.