Abstract

The goal of emotional voice conversion (EVC) is to convert the emotion of a speaker’s voice from one state to another while maintaining the original speaker’s identity and the linguistic substance of the message. Research on EVC in the Arabic language is well behind that conducted on languages with a wider distribution, such as English. The primary objective of this study is to determine whether Arabic emotions may be converted using a model trained for another language. In this work, we used an unsupervised many-to-many non-parallel generative adversarial network (GAN) voice conversion (VC) model called StarGANv2-VC to perform an Arabic EVC (A-EVC). The latter is realized by using pre-trained phoneme-level automatic speech recognition (ASR) and fundamental frequency (F0) models in the English language. The generated voice is evaluated by prosody and spectrum conversion in addition to automatic emotion recognition and speaker identification using a convolutional recurrent neural network (CRNN). The results of the evaluation indicated that male voices were scored higher than female voices and that the evaluation score for the conversion from neutral to other emotions was higher than the evaluation scores for the conversion of other emotions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call