Organizations, researchers, and software increasingly use automatic speech recognition (ASR) to transcribe speech to text. However, ASR can be less accurate for (i.e., biased against) certain demographic subgroups. This is concerning, given that the machine-learning (ML) models used to automatically score video interviews use ASR transcriptions of interviewee responses as inputs. To address these concerns, we investigate the extent of ASR bias and its effects in automatically scored interviews. Specifically, we compare the accuracy of ASR transcription for English as a second language (ESL) versus non-ESL interviewees, people of color (and Black interviewees separately) versus White interviewees, and male versus female interviewees. Then, we test whether ASR bias causes bias in ML model scores-both in terms of differential convergent correlations (i.e., subgroup differences in correlations between observed and ML scores) and differential means (i.e., shifts in subgroup differences from observed to ML scores). To do so, we apply one human and four ASR transcription methods to two samples of mock video interviews (Ns = 1,014 and 414), and then we train and test models using these different transcripts to score multiple constructs. We observed significant bias in the commercial ASR services across nearly all comparisons, with the magnitude of bias differing across the ASR services. However, the transcription bias did not translate into meaningful measurement bias for the ML interview scores-whether in terms of differential convergent correlations or means. We discuss what these results mean for the nature of bias, fairness, and validity of ML models for scoring verbal open-ended responses. (PsycInfo Database Record (c) 2024 APA, all rights reserved).
Read full abstract