Code-switching is the phenomenon whereby multilingual speakers spontaneously alternate between more than one language during discourse and is widespread in multilingual societies. Current state-of-the-art automatic speech recognition (ASR) systems are optimised for monolingual speech, but performance degrades severely when presented with multiple languages. We address ASR of speech containing switches between English and four South African Bantu languages. No comparable study on code-switched speech for these languages has been conducted before and consequently no directly applicable benchmarks exist. A new and unique corpus containing 14.3 hours of spontaneous speech extracted from South African soap operas was used to perform our study. The varied nature of the code-switching in this data presents many challenges to ASR. We focus specifically on how the language model can be improved to better model bilingual language switches for English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. Code-switching examples in the corpus transcriptions were extremely sparse, with the majority of code-switched bigrams occurring only once. Furthermore, differences in language typology between English and the Bantu languages and among the Bantu languages themselves contribute further challenges. We propose a new method using word embeddings trained on text data that is both out-of-domain and monolingual for the synthesis of artificial bilingual code-switched bigrams to augment the sparse language modelling training data. This technique has the particular advantage of not requiring any additional training data that includes code-switching. We show that the proposed approach is able to synthesise valid code-switched bigrams not seen in the training set. We also show that, by augmenting the training set with these bigrams, we are able to achieve notable reductions for all language pairs in the overall perplexity and particularly substantial reductions in the perplexity calculated across a language switch boundary (between 5 and 31%). We demonstrate that the proposed approach is able to reduce the unseen code-switched bigram types in the test sets by up to 20.5%. Finally, we show that the augmented language models achieve reductions in the word error rate for three of the four language pairs considered. The gains were larger for language pairs with disjunctive orthography than for those with conjunctive orthography. We conclude that the augmentation of language model training data with code-switched bigrams synthesised using word embeddings trained on out-of-domain monolingual text is a viable means of improving the performance of ASR for code-switched speech.
Read full abstract