Exploring the variable efficacy of Google speech-to-text with spontaneous bilingual speech in Cantonese and English

Nikolai Schwarz,Molly Babel,Khia A Johnson

doi:10.1121/10.0008580

Abstract

With the growth of Automatic Speech Recognition (ASR) and voice user interface software, it is important to test for efficacy across different language varieties and identify sources of bias. Recent work assessing ASR efficacy and bias implicates factors like race, gender, dialect, and age as leading to different efficacy rates. Multilingualism presents another source of variation that ASR systems must grapple with, ranging from code-switching to phonetic variation both within and across speakers. Thus, variable ASR performance is likely exacerbated for multilingual communities. Using a spontaneous Cantonese-English bilingual speech corpus (Johnson, 2021), this study tests the efficacy of Google Speech-To-Text (STT) language models (Canadian English and Hong Kong Cantonese) with a heterogeneous bilingual speech community. Efficacy is assessed via fuzzy string matching between the STT and the manually corrected transcripts. STT performance is variable but overall better for English. Confidence ratings and matching scores are evaluated alongside listener ratings of perceived accentedness and various demographic groupings. The results of this study will provide practical guidance for using STT in the context of speech production research pipelines while highlighting its drawbacks concerning bias in a relatively understudied, multilingual group.

Full Text