Impact of Accuracy and Latency on Mean Opinion Scores for Speech Recognition Solutions

James Scovell,Marco Beltman,Rina Doherty,Rania Elnaggar,Chaitanya Sreerama

doi:10.1016/j.promfg.2015.07.434

James Scovell, Marco Beltman + Show 3 more

Open Access

https://doi.org/10.1016/j.promfg.2015.07.434

Copy DOI

Journal: Procedia manufacturing	Publication Date: Jan 1, 2015
Citations: 4	License type: cc-by-nc-nd

Affiliation: Mission College, Intel (United States)

Abstract

Speech recognition is no longer a technology of the future and is now broadly adopted in many products. Some solutions use low power, always on keyword spotting techniques to wake up the device before engaging the large vocabulary continuous speech recognition engine. This staged approach decreases power consumption and increases noise robustness. This paper presents a study that tested the effects of accuracy and latency on subjective ratings for a keyword triggered speech solution. A specialized software framework was developed in which the accuracy and latency of tasks were systematically controlled to understand the impact on user experience. A user interface was developed, based on existing industry solutions to simulate realistic use case scenarios. A within-subjects design was employed, in which data from a total of 47 participants was collected. Participants were asked to rate their experience on a five point scale (5 = excellent, 4 = good, 3 = fair, 2 = poor, 1 = bad). The experimental design followed an ITU MOS methodology for subjective assessment. There were three different latencies and four different accuracy levels for a total of 12 combinations. A two-way repeated measures analysis of variance indicated that the mean subjective ratings were a strong function of accuracy, and a weak function of latency. The relationship between mean perceptual ratings, accuracy, and latency were uncovered. With minimal degradation to accuracy, participants had high tolerance for latencies with average experience ratings in the ‘good’ range even for latencies up to 4seconds. Participants had a low tolerance when accuracy dropped to 70% or below, with average experience ratings below the ‘good’ range. In addition, thresholds were established for the upper acceptability bound of latency using the time until repeat of a command. These metrics and methods provide key insights to set user-centric design targets and inform architectural optimizations.

Full Text