Abstract

This study investigates the perception of coarticulatory vowel nasality generated using different text-to-speech (TTS) methods in American English. Experiment 1 compared concatenative and neural TTS using a 4IAX task, where listeners discriminated between a word pair containing either both oral or nasalized vowels and a word pair containing one oral and one nasalized vowel. Vowels occurred either in identical or alternating consonant contexts across pairs to reveal perceptual sensitivity and compensatory behavior, respectively. For identical contexts, listeners were better at discriminating between oral and nasalized vowels in neural than in concatenative TTS for nasalized same-vowel trials, but better discrimination for concatenative TTS was observed for oral same-vowel trials. Meanwhile, listeners displayed less compensation for coarticulation in neural than in concatenative TTS. To determine whether apparent roboticity of the TTS voice shapes vowel discrimination and compensation patterns, a "roboticized" version of neural TTS was generated (monotonized f0 and addition of an echo), holding phonetic nasality constant; a ratings study (experiment 2) confirmed that the manipulation resulted in different apparent roboticity. Experiment 3 compared the discrimination of unmodified neural TTS and roboticized neural TTS: listeners displayed lower accuracy in identical contexts for roboticized relative to unmodified neural TTS, yet the performances in alternating contexts were similar.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call