Abstract

Foreign accent conversion (FAC) is the problem of generating a synthetic voice that has the voice identity of a second-language (L2) learner and the pronunciation patterns of a native (L1) speaker. This synthetic voice has been referred to as a “golden-speaker” in the pronunciation-training literature. FAC is generally achieved by building a voice-conversion model that maps utterances from a source (L1) speaker onto the target (L2) speaker. As such, FAC requires that a reference utterance from the L1 speaker be available at synthesis time. This greatly restricts the application scope of the FAC system. In this work, we propose a “reference-free” FAC system that eliminates the need for reference L1 utterances at synthesis time, and transforms L2 utterances directly. The system is trained in two steps. First, a conventional FAC procedure is used to create a golden-speaker using utterances from a reference L1 speaker (which are then discarded) and the L2 speaker. Second, a pronunciation-correction model is trained to convert L2 utterances to match the golden-speaker utterances obtained in the first step. At synthesis time, the pronunciation-correction model directly transforms a novel L2 utterance into its golden-speaker counterpart. Our results show that the system reduces foreign accents in novel L2 utterances, achieving a 20.5% relative reduction in word-error-rate of an American English automatic speech recognizer and a 19% reduction in perceptual ratings of foreign accentedness obtained through listening tests. Over 73% of the listeners also rated golden-speaker utterances as having the same voice identity as the original L2 utterances.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call