Abstract

In this paper, we present an end-to-end speech recognition system for Japanese persons with articulation disorders resulting from athetoid cerebral palsy. Because their utterance is often unstable or unclear, speech recognition systems struggle to recognize their speech. Recent deep learning-based approaches have exhibited promising performance. However, these approaches require a large amount of training data, and it is difficult to collect sufficient data from such dysarthric people. This paper proposes a transfer learning method that transfers two types of knowledge corresponding to the different datasets: the language-dependent (phonetic and linguistic) characteristic of unimpaired speech and the language-independent characteristic of dysarthric speech. The former is obtained from Japanese non-dysarthric speech data, and the latter is obtained from non-Japanese dysarthric speech data. In the proposed method, we pre-train a model using Japanese non-dysarthric speech and non-Japanese dysarthric speech, and thereafter, we fine-tune the model using the target Japanese dysarthric speech. To handle the speech data of the two different languages in one model, we employ language-specific decoder modules. Experimental results indicate that our proposed approach can significantly improve speech recognition performance compared with other approaches that do not use additional speech data.

Highlights

  • In this study, we focused on the problem of speech recognition for persons with articulation disorders caused by the athetoid type of cerebral palsy

  • In this paper, we proposed a novel knowledge transfer approach for dysarthric speech recognition that uses speech data obtained both from physically unimpaired persons and from persons with dysarthria speaking in a different language

  • We used additional speech data obtained from physically unimpaired persons speaking in a different language

Read more

Summary

INTRODUCTION

We focused on the problem of speech recognition for persons with articulation disorders caused by the athetoid type of cerebral palsy. Vachhani et al [14] have proposed a feature enhancement method based on an autoencoder This method implies training the autoencoder with the use of a speech signal obtained from a physically unimpaired control speaker, after which this speech data was used to convert dysarthric speech into an improved feature representation. Considering multilingual speech recognition tasks, Toshniwal et al [25] have jointly trained a single ASR model across a dataset composed of data corresponding to nine Indian languages; this approach has shown improvements over monolingual models. The LAS model [28] consists of a listener module and a speller module which are trained jointly The goal of this model is to generate the probability of a grapheme sequence based on information from the previous graphemes and a sequence of acoustic features. D denotes the joint distribution over input sequence x and label sequence y

PROPOSED METHOD
EXPERIMENTAL EVALUATION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call