This study aims to explore the extent to which enunciation plays a crucial role in a speech-to-text (STT) system, especially when dealing with medical terminology. To achieve this, an audio dataset was recorded containing Polish medical terms and spoken diagnoses pronounced by healthcare professionals, including general practitioners and specialists in various fields such as cardiology, pulmonology, and radiology. The next step involved comprehensive acoustical and lexical analyses of the audio recordings. Features such as harmonic-to-noise ratio, spectral tilt, zero-crossing rate, formant dispersion, jitter, and shimmer were considered. Moreover, a transformer-based ASR (Automatic Speech Recognition) model was engaged in speech-to-text transcription. Several speech quality evaluation measures were used, such as WER (Word Error Rate), MER (Match Error Rate), WIL (Word Information Loss), WIP (Word Information Preserved), CER (Character Error Rate), etc. By measuring the STT model’s quality, it was possible to analyze the correlation between acoustical features and the expression style, as well as the speaker's distinctive vocabulary choices when reading acronyms. [Work supported by the Polish National Center for Research and Development (NCBR) project: “ADMEDVOICE-Adaptive intelligent speech processing system of medical personnel with the structuring of test results and support of therapeutic process,” no. INFOSTRATEG4/0003/2022.]
Read full abstract