Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS

Sahoko Nakayama,Satoshi Nakamura,Sakriani Sakti,Andros Tjandra

doi:10.1109/slt.2018.8639674

Abstract

Code-switching (CS) speech, in which speakers alternate between two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for spoken language technologies: automatic speech recognition (ASR) and text-to-speech synthesis (TTS), since the systems need to be able to handle the input in a multilingual setting. We may find code-switching text or code-switching speech in social media, but parallel speech and the transcriptions of code-switching data, which are suitable for training ASR and TTS, are generally unavailable. In this paper, we utilize a speech chain framework based on deep learning to enable ASR and TTS to learn code-switching in a semi-supervised fashion. We base our system on Japanese-English conversational speech. We first separately train the ASR and TTS systems with parallel speech-text of monolingual data (supervised learning) and perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). Experimental results reveal that such closed-loop architecture allows ASR and TTS to learn from each other and improve the performance even without any parallel code-switching data.

Full Text