Abstract
Most of speech recognition models currently in use have been dealt with speech of normal people. The speech recognition rate for patients with depression or Parkinson's disease (PD) who show differences in speech characteristics compared to normal subjects is lower than that of normal subjects. This study explores the model to enhance accuracy of speech recognition for individuals who have depression or PD, aiming to provide them more accurate service. In this study, considering the speech features of patients with depression or PD, we designed a model with the assumption that understanding the overall meaning and context of speech through the utilization of global information, rather than local information, is more effective in enhancing recognition accuracy. We propose the m-Globalformer, a model based on the Globalformer architecture that combines the squeeze-and-excitation (SE) module with the Transformer. The m-Globalformer enhances the utilization of global information by modifying the base SE module. The model employs pre-training and fine-tuning strategies, considering the limited speech data of the patients. In the initial training phase, a large-scale normal speech dataset was used, followed by fine-tuning the model with a small-scale dataset of depression or PD patients. The m-Globalformer demonstrated superior performance in our experiments, achieved character error rates (CER) of 11.28% for depression and 19.67% for PD.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have