Data Augmentation using virtual microphone array synthesis and multi-resolution feature extraction for isolated word dysarthric speech recognition

Mariya Celin T A,Vijayalakshmi P,Nagarajan Thangavelu

doi:10.1109/jstsp.2020.2972161

Abstract

Dysarthria is a speech-motor disorder that affects the articulatory systems inhibiting their speech communication efforts. To handle their communication problems, a speech recognition-based augmentative and alternative communication aid is used as an attractive alternative. However, successful development of an automatic speech recognition (ASR)-based aid depends on the availability of sufficient speech data for training. Building an ASR system for dysarthric speakers is difficult due to limited amount of training data and large inter-and-intra speaker variabilities. Using normal speaker's speech data for data augmentation or adaptation for low intelligible dysarthric speakers would be extremely challenging due to huge variation in acoustic characteristics between these two category of speakers. In the current article, a two-level data augmentation is performed on dysarthric speech based on virtual linear microphone array-based synthesis followed by multi-resolution feature extraction. With the augmented speech data, an isolated word hybrid DNN-HMM-based ASR system is trained using UA speech corpus and Tamil dysarthric speech corpus developed by the authors. Performance of the ASR system shows a reduced WER of up to 32.79%, 35.75% for low and very low intelligible speakers with dysarthria compared to recent works on data augmentation reported for dysarthric speech recognition.

Full Text