Is automatic speech-to-text transcription ready for use in psychological experiments?

Kirsten Ziman,Paxton C Fitzpatrick,Jeremy R Manning,Campbell E Field,Andrew C Heusser

doi:10.3758/s13428-018-1037-4

Kirsten Ziman, Paxton C Fitzpatrick + Show 3 more

Open Access

https://doi.org/10.3758/s13428-018-1037-4

Copy DOI

Journal: Behavior Research Methods	Publication Date: Apr 23, 2018
Citations: 22	License type: cc-by

Affiliation: Dartmouth College

Abstract

Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, The Journal of General Psychology, 61(1),65-94:1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g., typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g., administered online), as well as a new generation of experiments that decode speech on the fly and adapt experimental parameters based on participants' prior responses.

Highlights

Speech-to-text engines became popular in the 1990s (Kurzweil et al, 1990) when the performance of speech recognition algorithms (primarily based on Hidden Markov Models; Rabiner (1989)) reached sufficient levels to provide plausible, though still often inaccurate, transcripts (Bamberg et al, 1990)
We sought to evaluate the transcription accuracy of a modern speech-to-text engine applied to recordings of verbal responses from a list-learning experiment
0.0 0.0 0.2 0.4 0.6 0.8 1.0 False alarm rate of the set of words each contained. This finding, that the two sets of transcripts matched well, indicates that the verbal responses transcribed by the speech-to-text engine were an accurate reflection of what the participants said

Summary

Introduction

Speech-to-text engines became popular in the 1990s (Kurzweil et al, 1990) when the performance of speech recognition algorithms (primarily based on Hidden Markov Models; Rabiner (1989)) reached sufficient levels to provide plausible, though still often inaccurate, transcripts (Bamberg et al, 1990). Speech decoding has the potential to save researchers an enormous amount of time when analyzing verbal response data, and to enable new experimental designs that adapt based on parameters derived from decoded speech data. Whatever their current limitations, as speech-to-text algorithms continue to mature, their utility in psychological research should improve as well

Methods

Results

Discussion

Conclusion