Comparing the performance of classic voice-driven assistive systems for dysarthric speech

Wei-Zhong Zheng,Ji-Yan Han,Hsiu-Lien Cheng,Wei-Chung Chu,Ko-Chiang Chen,Ying-Hui Lai

doi:10.1016/j.bspc.2022.104447

Abstract

Voice-driven communication assistive systems—speech enhancement (SE), voice conversion (VC), and automatic speech recognition with text-to-speech (ASR-TTS)—are recognized approaches for improving dysarthric speakers’ speech intelligibility. However, which approach performs better for moderate dysarthric patients is unclear. This study compared the benefits of three classic difference-type voice-driven assistive systems for dysarthric patients under identical test conditions. The benefits of the three systems for dysarthric patients’ speech intelligibility were compared; 14 mild-to-severedysarthric patients and five speakers with normal speech were invited to record the training sets for these systems. Five moderate dysarthric patients were selected to record two additional testing sets, which were used for evaluating the systems’ benefits. Google Automatic Speech Recognition’s (Google ASR) evaluation metrics and listening tests verified each system’s speech intelligibility and quality. The speech intelligibility results produced by Google ASR were 7.0%, 22.9%, and 93.8% for the SE, VC, and ASR-TTS systems, respectively. Regarding the listening test, the performance of speech intelligibility and quality were 38.7%, 40.5%, 95.5%, and 1.81, 2.18, 4.56 for SE, VC, and ASR-TTS systems, respectively. The ASR-TTS system performed better than SE and VC. Furthermore, t-distributed stochastic neighbor embedding (t-SNE) analysis was used to additionally compare the differences between the systems. The t-SNE analysis results indicated that ASR-TTS’ phonetic posteriorgram features provided stable performance compared with the other speech features (log-power spectrum and spectra) in the SE and VC systems. Results showed that the ASR-TTS is a potential system to improve moderate dysarthric patients’ speech intelligibility and quality in future applications.

Full Text