Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System

Seyed Reza Shahamiri

doi:10.1109/tnsre.2021.3076778

Abstract

Dysarthria is a disorder that affects an individual's speech intelligibility due to the paralysis of muscles and organs involved in the articulation process. As the condition is often associated with physically debilitating disabilities, not only do such individuals face communication problems, but also interactions with digital devices can become a burden. For these individuals, automatic speech recognition (ASR) technologies can make a significant difference in their lives as computing and portable digital devices can become an interaction medium, enabling them to communicate with others and computers. However, ASR technologies have performed poorly in recognizing dysarthric speech, especially for severe dysarthria, due to multiple challenges facing dysarthric ASR systems. We identified these challenges are due to the alternation and inaccuracy of dysarthric phonemes, the scarcity of dysarthric speech data, and the phoneme labeling imprecision. This paper reports on our second dysarthric-specific ASR system, called Speech Vision (SV) that tackles these challenges by adopting a novel approach towards dysarthric ASR in which speech features are extracted visually, then SV learns to see the shape of the words pronounced by dysarthric individuals. This visual acoustic modeling feature of SV eliminates phoneme-related challenges. To address the data scarcity problem, SV adopts visual data augmentation techniques, generates synthetic dysarthric acoustic visuals, and leverages transfer learning. Benchmarking with other state-of-the-art dysarthric ASR considered in this study, SV outperformed them by improving recognition accuracies for 67% of UA-Speech speakers, where the biggest improvements were achieved for severe dysarthria.

Highlights

D YSARTHRIA is a neurological motor speech disorder characterized by an individual’s loss of control of their motor subsystems [1]
We developed a dysarthric multi-networks speech recognizer (DM-NSR) based on a realization of multi-views multilearners (MVML) using an array of artificial neural networks (ANNs) capable of improving the tolerance of dysarthric speech
We identified three challenges in developing dysarthric automatic speech recognition (ASR) systems and proposed a system called Speech Vision that attempts to address them

Summary

Introduction

D YSARTHRIA is a neurological motor speech disorder characterized by an individual’s loss of control of their motor subsystems [1]. Dysarthria can often accompany neurological conditions; many people with dysarthria are physically debilitated, which means interfacing with digital devices and computers via mouse, keyboard, and touchscreen may be challenging or impossible. For such individuals, automatic speech recognition (ASR) technologies can be a desirable alternative to enable them to interface with digital devices or become a communication intermediary [4]; ASR technologies can significantly improve the quality of life of dysarthric individuals via their applications in Augmentative/Alternative Communication (AAC) tools

Methods

Results

Conclusion