Abstract
People who are blind increasingly use synthesized speech as a primary output modality when interacting with computers, mobile devices, and web-based services. They typically prefer to listen to the synthesized speech at speeds multiple times real time, and consequently develop strong preferences for particular text-to-speech (TTS) engines and voices. These usage-based preferences lead to potentially incorrect assumptions about which TTS approaches are “best” for fast speech. We report on a cross-system comparison of the intelligibility of fast synthesized speech for users who are blind from birth. We used male and female voices from multiple TTS engines representing the main approaches to TTS: diphone synthesis, unit selection synthesis, HMM-based synthesis, and formant synthesis. We recruited participants from organizations that work with the blind. Each participant listened to and transcribed semantically underspecified sentences from a single TTS engine and voice, spoken at speeds ranging from 300 to 550 words/min. We used TTS engine, algorithm for speeding up synthesized speech, and speaker sex as independent variables, and transcription accuracy as the primary dependent variable. We report differences by TTS approach, speed up method, TTS engine type, and speaker sex.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.