An improved CycleGAN-based emotional voice conversion model by augmenting temporal dependency with a transformer

Changzeng Fu,Hiroshi Ishiguro,Chaoran Liu,Carlos Toshinori Ishi

doi:10.1016/j.specom.2022.09.002

Abstract

Emotional voice conversion (EVC) is a task that converts an utterance’s emotional features into a target one while retaining semantic information and speaker identity. Recently, some researchers leverage deep learning methods to improve the performance of EVC, such as deep neural network (DNN), sequence-to-sequence model (seq2seq), long-short-term memory network (LSTM), and convolutional neural network (CNN), as well as their combinations with an attention mechanism. However, their methods always suffer from some instability problems (e.g., mispronunciations and skipped phonemes) because these models fail to capture temporal intra-relationships among a wide range of frames, resulting in unnatural speech and discontinuous emotional expression. To enhance the ability to capture intra-relations among frames by augmenting the temporal dependency of models, we explored the power of a transformer in this study. Specifically, we proposed a CycleGAN-based model with the transformer and investigated its ability in the EVC task. In the training procedure, we adopted curriculum learning to gradually increase the frame length to ensure that the model can monitor from short segments throughout the entire speech. The proposed method was evaluated on a Japanese emotional speech dataset and Emotional Speech Dataset (ESD, contains English and Chinese speech). Then, it was compared to widely used EVC baselines (ACVAE, CycleGAN) involving objective and subjective evaluations. The results indicate that our proposed model can convert emotion with higher emotional similarity, quality, and naturalness. • A CycleTransGAN is proposed to improve its performance on the emotional voice conversion (EVC) task. • Curriculum learning was adopted to gradually increase the input length during training. • A fine-grained level discriminator was designed to enhance the model’s ability to convert emotional voices. • The proposed method was evaluated on a Japanese emotional speech dataset and Emotional Speech Dataset (ESD, containing English and Chinese speech). • The transformer enhanced the model’s temporal dependency with a wider range, which improved the quality of converted speech.

Full Text