Abstract
Amyotrophic lateral sclerosis (ALS) patients experience progressive speech deterioration due to muscle paralysis, leading to eventual loss of verbal communication capability. Text-to-speech synthesis (TTS) is an important technology for speech generating devices, enabling users to communicate using generic electronic voices, but often without the vocal identity of the users. Our work is aimed at personalizing TTS voices for people with ALS induced dysarthria by integrating machine learning and speech processing techniques of voice conversion (VC) and TTS. This is challenging as only small quantities of dysarthric speech are available from individual patients. Our system includes both timbre and prosody conversion for VC, neural TTS to generate TTS speech, and neural feature converter to interface VC and TTS. We collected speech data from 4 ALS target speakers with mild to severe dysarthria. Subjective listening tests showed that on average, our approach improved speech intelligibility by about 72% over the target speakers’ speech, the converted voice was 2 to 3 times more similar to ALS targets than to TTS sources, and the converted speech quality was in the MOS scale of fair to good.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.