Consensus Automatic Speech Recognition (CASR) in Cognitive Testing

Timothy J Herron,Michael Blank,David L Woods,Gabriel Sucich,Kristin Geraci,Kathleen Hall,Juliana Baldo

doi:10.1002/alz.068253

Abstract

AbstractBackgroundScoring verbal cognitive tests with automatic speech recognition (ASR) engines increases the efficiency of scoring and provides word timestamps that enable detailed temporal analyses of spoken responses. Here, we describe consensus ASR (CASR) procedures that incorporate multiple ASR engines to increase transcription and timing accuracy and generate CASR transcript confidence scores.MethodSeven ASR engines produced automatic transcriptions of both speech database samples (GMU Speech Accent Archive and NUS Auditory English Lexicon Project) and verbal test responses of 41 subjects from the California Cognitive Assessment Battery (CCAB). A novel Recognizer Output Voting Error Reduction (ROVER) algorithm was used to mutually align the transcripts, and a Bayesian weighted voting algorithm produced the best CASR transcript, mean word timestamps, and consensus scores. Word error rates (WER) gauged CASR accuracy against either predetermined or manually corrected transcripts.ResultDatabase sentence WERs from 1767 subjects ranged from a mean of 22% (Windows10 UWP) to 6% (Rev.ai) with CASR producing 5%, with no significant gender or age effects but better performance for native english speakers (Figure 1). In CCAB test responses, for limited word response tests CASR WERs ranged from 3% to less than 1% (Figure 2); for expansive word response tests CASR WERs ranged from 8% to 2% (Figure 3); and for discursive speech, CASR WERs ranged from 6% to 5%. Word start time ASR estimates for 594 database words in lists ranged in mean deviations from true times from 250ms std.dev. (Google) to 17ms std.dev. (Amazon) with CASR obtaining 14ms errors (Figure 4). Finally, consensus confidence scores from CCAB test responses, ranging from 0 to 1 (1 = complete agreement across ASR engines), show that CASR words with consensus scores above 0.8 and 0.9 are correct >99% and >99.8% of the time, respectively (Figure 5).ConclusionCASR produces transcripts for verbal test responses accurate enough for estimating scores in most limited word response tests. In large vocabulary response tests, CASR transcripts facilitate quick manual correction, and confidence values can identify transcript words needing manual correction. Patterns in CASR errors also indicate future substantial reductions to CASR WER on a per test basis.

Full Text