Abstract

The applications of automatic speech recognition (ASR) systems are proliferating, in part due to recent significant quality improvements. However, as recent work indicates, even state-of-the-art speech recognition systems – some which deliver impressive benchmark results, struggle to generalize across use cases. We review relevant work, and, hoping to inform future benchmark development, outline a taxonomy of speech recognition use cases, proposed for the next generation of ASR benchmarks. We also survey work on metrics, in addition to the de facto standard Word Error Rate (WER) metric, and we introduce a versatile framework designed to describe interactions between linguistic variation and ASR performance metrics.

Highlights

  • The applications of automatic speech recognition (ASR) systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few

  • Inspired by Goodhart’s law, which states that any measure that becomes a target ceases to be a good measure, we argue that as a field, it behooves us to think more about better benchmarks in order to gain a well-rounded view of the performance of ASR systems across domains

  • In the previous sections we have argued that a single aggregate statistic like the average word error rate (WER) can be too coarsegrained for describing the accuracy in a real-world deployment that targets multiple sociolinguistic slices of the population

Read more

Summary

Introduction

The applications of ASR systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few. This proliferation has been enabled by significant gains in ASR quality. Achieving state-of-the-art results on one of these sets does not necessarily mean that an ASR system will generalize successfully when faced with input from a wide range of domains at inference time: as Likhomanenko et al (2020) show, “no single validation or test set from public datasets is sufficient to measure transfer to other public datasets or to real-world audio data”. FAIR recently released the Casual Conversations dataset intended for AI fairness measurements (Hazirbas et al, 2021)

ASR Use Cases
Horizontals
Verticals
Practical Issues
Metrics
Metadata about Words
Real-Time Factor
Streaming ASR
Inference and Training
Contextual Biasing
Hallucination
Debuggability and Fixability
Demographically Informed Quality
Population-Weighted Slicing Framework
Defining slices
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call