This study explores the production and perception of word-final devoicing in German across text-to-speech (from technology used in common voice-AI “smart” speaker devices—specifically, voices from Apple and Amazon) and naturally produced utterances. First, the phonetic realization of word-final devoicing in German across text-to-speech (TTS) and naturally produced word productions was compared. Acoustic analyses reveal that the presence of cues to a word-final voicing contrast varied across speech types. Naturally produced words with phonologically voiced codas contain partial voicing, as well as longer vowels than words with voiceless codas. However, these distinctions are not present in TTS speech. Next, German listeners completed a forced-choice identification task, in which they heard the words and made coda consonant categorizations, in order to examine the intelligibility consequences of the word-final devoicing patterns across speech types. Intended coda identifications are higher for the naturally produced productions than for TTS. Moreover, listeners systematically misidentified voiced codas as voiceless in TTS words. Overall, this study extends previous literature on speech intelligibility at the intersection of speech synthesis and contrast neutralization. TTS voices tend to neutralize salient phonetic cues present in natural speech. Subsequently, listeners are less able to identify phonological distinctions in TTS. We also discuss how investigating which cues are more salient in natural speech can be beneficial in synthetic speech generation to make them more natural and also easier to perceive.
Read full abstract