Abstract

Building automatic speech recognition (ASR) systems for speakers with dysarthria is a very challenging task. Although multi-modal ASR has received increasing attention recently, incorporating real articulatory data with acoustic features has not been widely explored in the dysarthric speech community. This paper investigates the effectiveness of multi-modal acoustic modelling for dysarthric speech recognition using acoustic features along with articulatory information. The proposed multi-stream architectures consist of convolutional, recurrent and fully-connected layers allowing for bespoke per-stream pre-processing, fusion at the optimal level of abstraction and post-processing. We study the optimal fusion level/scheme as well as training dynamics in terms of cross-entropy and WER using the popular TORGO dysarthric speech database. Experimental results show that fusing the acoustic and articulatory features at the empirically found optimal level of abstraction achieves a remarkable performance gain, leading to up to 4.6% absolute (9.6% relative) WER reduction for speakers with dysarthria.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.