Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

Taekyung Ahn,Yeonjung Hong,Younggon Im,Do Hyung Kim,Dayoung Kang,Joo Won Jeong,Jae Won Kim,Min Jung Kim,Ah-Ra Cho,Hosung Nam,Dae-Hyun Jang

doi:10.1080/02699206.2024.2387609

Abstract

This study presents a model of automatic speech recognition (ASR) that is designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Because ASR models trained for general purposes mainly predict input speech into standard spelling words, well-known high-performance ASR models are not suitable for evaluating pronunciation in children with SSDs. We fine-tuned the wav2vec2.0 XLS-R model to recognise words as they are pronounced by children, rather than converting the speech into their standard spelling words. The model was fine-tuned with a speech dataset of 137 children with SSDs pronouncing 73 Korean words that are selected for actual clinical diagnosis. The model's Phoneme Error Rate (PER) was only 10% when its predictions of children's pronunciations were compared to human annotations of pronunciations as heard. In contrast, despite its robust performance on general tasks, the state-of-the-art ASR model Whisper showed limitations in recognising the speech of children with SSDs, with a PER of approximately 50%. While the model still requires improvement in terms of the recognition of unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.

Full Text