Abstract

The generation of emotional speech is a challenging and widely applied research topic in the field of speech processing. Because the design method of effective speech feature expression and generation model directly affects the accuracy of emotional speech generation, it is difficult to find a general solution of emotional speech synthesis. In this paper, the CycleGAN model is used as the starting point, and the improved convolution neural network (CNN) model and identity mapping loss scheme are used to achieve effective timing information capture. At the same time, we learn the positive mapping and the reverse mapping to find the best matching design scheme, and retain the speech information in this process, without relying on other audio data. Experiments show that the emotional speech can be accurately recognized by comparing the speech emotion before and after the improvement on the speech corpus of children’s reading. By comparing with the common emotional speech generation model, the advantages of the model proposed in this paper are verified.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call