Abstract
The applications of automatic speech recognition (ASR) systems are proliferating, in part due to recent significant quality improvements. However, as recent work indicates, even state-of-the-art speech recognition systems – some which deliver impressive benchmark results, struggle to generalize across use cases. We review relevant work, and, hoping to inform future benchmark development, outline a taxonomy of speech recognition use cases, proposed for the next generation of ASR benchmarks. We also survey work on metrics, in addition to the de facto standard Word Error Rate (WER) metric, and we introduce a versatile framework designed to describe interactions between linguistic variation and ASR performance metrics.
Highlights
The applications of automatic speech recognition (ASR) systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few
Inspired by Goodhart’s law, which states that any measure that becomes a target ceases to be a good measure, we argue that as a field, it behooves us to think more about better benchmarks in order to gain a well-rounded view of the performance of ASR systems across domains
In the previous sections we have argued that a single aggregate statistic like the average word error rate (WER) can be too coarsegrained for describing the accuracy in a real-world deployment that targets multiple sociolinguistic slices of the population
Summary
The applications of ASR systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few. This proliferation has been enabled by significant gains in ASR quality. Achieving state-of-the-art results on one of these sets does not necessarily mean that an ASR system will generalize successfully when faced with input from a wide range of domains at inference time: as Likhomanenko et al (2020) show, “no single validation or test set from public datasets is sufficient to measure transfer to other public datasets or to real-world audio data”. FAIR recently released the Casual Conversations dataset intended for AI fairness measurements (Hazirbas et al, 2021)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have