How Might We Create Better Benchmarks for Speech Recognition?

Alëna Aksënova,James Flynn,Pavel Golik,Daan Van Esch

doi:10.18653/v1/2021.bppf-1.4

Abstract

The applications of automatic speech recognition (ASR) systems are proliferating, in part due to recent significant quality improvements. However, as recent work indicates, even state-of-the-art speech recognition systems – some which deliver impressive benchmark results, struggle to generalize across use cases. We review relevant work, and, hoping to inform future benchmark development, outline a taxonomy of speech recognition use cases, proposed for the next generation of ASR benchmarks. We also survey work on metrics, in addition to the de facto standard Word Error Rate (WER) metric, and we introduce a versatile framework designed to describe interactions between linguistic variation and ASR performance metrics.

Highlights

The applications of automatic speech recognition (ASR) systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few
Inspired by Goodhart’s law, which states that any measure that becomes a target ceases to be a good measure, we argue that as a field, it behooves us to think more about better benchmarks in order to gain a well-rounded view of the performance of ASR systems across domains
In the previous sections we have argued that a single aggregate statistic like the average word error rate (WER) can be too coarsegrained for describing the accuracy in a real-world deployment that targets multiple sociolinguistic slices of the population

Summary

Introduction

The applications of ASR systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few. This proliferation has been enabled by significant gains in ASR quality. Achieving state-of-the-art results on one of these sets does not necessarily mean that an ASR system will generalize successfully when faced with input from a wide range of domains at inference time: as Likhomanenko et al (2020) show, “no single validation or test set from public datasets is sufficient to measure transfer to other public datasets or to real-world audio data”. FAIR recently released the Casual Conversations dataset intended for AI fairness measurements (Hazirbas et al, 2021)

ASR Use Cases

Horizontals

Verticals

Practical Issues

Metrics

Metadata about Words

Real-Time Factor

Streaming ASR

Inference and Training

Contextual Biasing

Hallucination

Debuggability and Fixability

Demographically Informed Quality

Population-Weighted Slicing Framework

Defining slices

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

How Might We Create Better Benchmarks for Speech Recognition?

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 9	License type: cc-by

Similar Papers

Combined speech enhancement and auditory modelling for robust distributed speech recognition
Edward Jones ... Ronan Flynn
Speech Communication | VOL. 50
Edward Jones, et. al.Edward Jones ... Ronan Flynn
20 May 2008
Speech Communication | VOL. 50

Interaction between people with dysarthria and speech recognition systems: A review
Fernando Loizides ... Omer Rana
Assistive Technology | VOL. 35
Fernando Loizides, et. al.Fernando Loizides ... Omer Rana
16 Apr 2022
Assistive Technology | VOL. 35

Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems
Andreas M Zavou ... Panayiotis G Georgiou
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 22
Andreas M Zavou, et. al.Andreas M Zavou ... Panayiotis G Georgiou
01 Mar 2014
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 22

Autocorrelation-based Methods for Noise-Robust Speech Recognition
Gholamreza Farahani ... Mohammad Ahadi
-
Gholamreza Farahani, et. al.Gholamreza Farahani ... Mohammad Ahadi
01 Jun 2007
01 Jun 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

How Might We Create Better Benchmarks for Speech Recognition?

Abstract

Highlights

Summary

Talk to us

Similar Papers