An Approach to Cross-Lingual Voice Conversion

Sai Sirisha Rallabandi,Suryakanth V Gangashetty

doi:10.1109/ijcnn.2019.8852225

Abstract

The most prevalent multilingual Text-to-Speech (TTS) synthesis systems encounter an unnatural speaker shift at the language boundaries. This is observed when they are employed for code-mixed TTS synthesis. For the very fact that the collection of polyglot speech is non-trivial, many alternative approaches have been in focus. Cross-Lingual Voice Conversion (CLVC) has been one of those to generate speech with desired speaker and language identities. Our aim in this paper is to design a light-weighted CLVC framework between a pair of Mandarin-English speakers. CLVC is challenging when compared to traditional Voice Conversion (VC) because of its nature of accommodating unaligned corpus from the source and target speakers. We thus focus on generating a parallel corpus for CLVC and bridging the gap between speakers and languages. We perform a text-independent voice conversion with a three-layered conventional Neural Network (NN) for this purpose. The main contributions include i) Source similarity in both training and conversion stages of CLVC, ii) generation of a parallel corpus and iii) text independent and transcription free CLVC. We exploit two variants of a Neural Network in the proposed framework, i) an autoencoder to enable the source similarity and generation of parallel corpus, ii) a traditional DNN for feature mapping between the source and target. The subjective and objective evaluations show that the proposed method is indeed capable of performing a CLVC with an auto-encoded speech.

Full Text