Learning to Model Prosodic and Spectral Features for Non-parallel Emotive Speech Conversion

Sri Harsha Dumpala,Sageev Oore

doi:10.21428/594757db.930ce165

Abstract

Emotion conversion in speech has attracted recent attention owing to its importance in human-machine interaction and the current high quality of speech synthesis. Most existing approaches rely on parallel data, which is not available in many real-time applications. We propose a non-parallel emotion conversion approach based on the cycle generative adversarial network (cycleGAN) framework. We introduce new variants of cycleGAN that use recurrent neural networks and multi-kernel convolutional neural networks for modeling prosodic features along with spectral features for emotion conversion in speech. Subjective evaluation results show the effectiveness of our approach in converting natural speech, and also unseen synthesized speech samples to different target emotive states.

Full Text