Development and benchmarking of a Korean audio speech recognition model for Clinician-Patient conversations in radiation oncology clinics

Seok-Joo Chun,Jung Bin Park,Hyejo Ryu,Bum-Sup Jang

doi:10.1016/j.ijmedinf.2023.105112

Abstract

BackgroundThe purpose of this study is to develop an audio speech recognition (ASR) deep learning model for transcribing clinician-patient conversations in radiation oncology clinics. MethodsWe finetuned the pre-trained English QuartzNet 15x5 model for the Korean language using a publicly available dataset of simulated situations between clinicians and patients. Subsequently, real conversations between a radiation oncologist and 115 patients in actual clinics were then prospectively collected, transcribed, and divided into training (30.26 h) and testing (0.79 h) sets. These datasets were used to develop the ASR model for clinics, which was benchmarked against other ASR models, including the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.' ResultsThe pre-trained English ASR model was successfully fine-tuned and converted to recognize the Korean language, resulting in a character error rate (CER) of 0.17. However, we found that this performance was not sustained on the real conversation dataset. To address this, we further fine-tuned the model, resulting in an improved CER of 0.26. Other developed ASR models, including 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.', showed a CER of 0.31, 0.28, and 0.25, respectively. On the general Korean conversation dataset, 'zeroth-korean,' our model showed a CER of 0.44, while the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model' resulted in CERs of 0.26, 0.98, and 0.99, respectively. ConclusionIn conclusion, we developed a Korean ASR model to transcribe real conversations between a radiation oncologist and patients. The performance of the model was deemed acceptable for both specific and general purposes, compared to other models. We anticipate that this model will reduce the time required for clinicians to document the patient's chief complaints or side effects.

Full Text