Transcripts are vital in any research involving conversation. Most transcription is conducted manually, by experts; a process which can take many times longer than the conversation itself. Recently, there has been interest in using automatic speech recognition (ASR) to automate transcription, driven by the wide availability of ASR platforms such as OpenAI’s Whisper. However as studies typically focus on metrics such as the word error rate, there is a lack of detail about ASR transcript quality and the practicalities of ASR use in research. In this paper we review six state-of-the-art ASR technologies, three commercial and three open-source. We assess their capabilities as automatic transcription tools. We find that the commercial ASR systems mostly capture an accurate representation of what was said, and overlapping speech is handled well. Unlike prior work, we show that commercial ASR also preserves the location, but not necessarily the spelling of a large majority of non-lexical tokens: short words such as uh-hum which play vital roles in conversation. We show that the open-source ASR systems produce substantially more errors than their commercial counterparts. However, we highlight how the cost and privacy advantages of open-source ASR may outweigh performance issues in certain applications. We discuss practical considerations for ASR deployment in research, concluding that present ASR technology cannot yet replace the trained transcriber. However, a high-quality initial transcript generated by ASR can provide a good starting point and may be further refined by manual correction. We make all ASR-generated transcripts available for future research in the supplementary material.
Read full abstract