Abstract

We report on the development of two reference corpora for the analysis of Sepedi-English code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call