The Benefit Obtained from Visually Displayed Text from an Automatic Speech Recognizer During Listening to Speech Presented in Noise

Adriana A Zekveld,Marcel S M G Vlaming,Sophia E Kramer,Tammo Houtgast,Judith M Kessens

doi:10.1097/aud.0b013e31818005bd

Adriana A Zekveld, Marcel S M G Vlaming + Show 3 more

https://doi.org/10.1097/aud.0b013e31818005bd

Copy DOI

Journal: Ear & Hearing	Publication Date: Dec 1, 2008
Citations: 21

Affiliation: Amsterdam UMC Location VUmc

Abstract

The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT is defined as the speech-to-noise ratio at which 50% of the test sentences are reproduced correctly. In the auditory-alone SRT tests, the test sentences were presented only auditorily; in the audiovisual SRT test, the ASR output of each test sentence was also presented textually. The ASR system was used in two recognition modes: recognition of spoken words (word output), or recognition of speech sounds or phones (phone output). The benefit obtained from the ASR output was defined as the difference between the auditory-alone and the audiovisual SRT. We also examined the readability of unimodally displayed ASR output (i.e., the percentage of sentences in which ASR errors were identified and accurately corrected). In experiment 1, the readability and benefit obtained from ASR word output (n = 14) was compared with the benefit obtained from ASR phone output (n = 10). In experiment 2, the effect of presenting an indication of the ASR confidence level was examined (n = 14). The effect of delaying the presentation of the text relative to the speech (up to 6 sec) was examined in experiment 3 (n = 24). The ASR accuracy level was varied systematically in each experiment. Mean readability scores ranged from 0 to 46%, depending on ASR accuracy. Speech comprehension improved when the ASR output was displayed. For example, when the ASR output corresponded to readability scores of only about 20% correct, the text improved the SRT by about 3 dB SNR in the audiovisual SRT test. This improvement corresponds to an increase in speech comprehension of about 35% in critical conditions. Equally readable phone and word output provides similar benefit in speech comprehension. For equal ASR accuracies, both the readability and the benefit from the word output generally exceeded the benefits from the phone output. Presenting information about the ASR confidence level did not influence either the readability or the benefit obtained from the word output. Delaying the text relative to the speech moderately decreased the benefit. The present study indicates that speech comprehension improves considerably by textual ASR output with moderate accuracies. The study shows that this improvement depends on the readability of the ASR output. Word output has better accuracy and readability than phone output. Listeners are therefore better able to use the ASR word output than phone output to improve speech comprehension. The ability of older listeners and listeners with hearing impairments to use ASR output in speech comprehension requires further study.

Full Text