Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies

Hans Rutger Bosker

doi:10.3758/s13428-021-01542-4

Hans Rutger Bosker

Open Access

PDF Available

https://doi.org/10.3758/s13428-021-01542-4

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Many studies of speech perception assess the intelligibility of spoken sentence stimuli by means of transcription tasks (‘type out what you hear’). The intelligibility of a given stimulus is then often expressed in terms of percentage of words correctly reported from the target sentence. Yet scoring the participants’ raw responses for words correctly identified from the target sentence is a time-consuming task, and hence resource-intensive. Moreover, there is no consensus among speech scientists about what specific protocol to use for the human scoring, limiting the reliability of human scores. The present paper evaluates various forms of fuzzy string matching between participants’ responses and target sentences, as automated metrics of listener transcript accuracy. We demonstrate that one particular metric, the token sort ratio, is a consistent, highly efficient, and accurate metric for automated assessment of listener transcripts, as evidenced by high correlations with human-generated scores (best correlation: r = 0.940) and a strong relationship to acoustic markers of speech intelligibility. Thus, fuzzy string matching provides a practical tool for assessment of listener transcript accuracy in large-scale speech intelligibility studies. See https://tokensortratio.netlify.app for an online implementation.

Highlights

Many studies of speech perception are concerned with the processing of speech in adverse listening conditions
Considering that all automated metrics evaluated here are grounded in mathematical functions, their consistency in terms of ‘replicability’ is perfect
They are extremely efficient compared to generating human percentage words correct (PWC) scores: on a computer system with a 1.8 GHz Intel Core i7 processor and 16 GB of RAM, calculating all scores for all data sets in one go in R (LS and J) and Python (TSR) took under 30 seconds

Summary

Introduction

Many studies of speech perception are concerned with the processing of speech in adverse listening conditions. Human scorers are flexible and can score the raw responses while taking into account potential semantic, syntactic, or orthographic constraints motivated by the particular study design (e.g., counting non-literal responses containing synonyms, conjugated forms, or obvious spelling errors as correct). This flexibility presents concerns: there is for instance no consensus on what protocol to follow when scoring PWC. This lack of consensus means that a given response may receive different PWC scores from different human scorers, or even from the same scorer at different occasions Another drawback of manually scoring PWC is that it is a time-consuming and resource-intensive task.

Methods

Results

Conclusion