Abstract

Compared to human-directed adaptations, less is known about how humans adjust their speech for intelligibility benefits while interacting with an AI-powered voice interface. In this study, we investigate human speech adaptations in human-to-human versus human-to-AI unscripted conversations. Specifically, we examine the production of words containing intervocalic /t-d/ in a conversation between a speaker who distinguishes these two stops (e.g., metal–medal) and a speaker (“flapper”) who merges the two stops into a flap /ɾ/. We predict that misperceptions of intervocalic /t-d/ may cause confusions, thus motivating adaptations. We record native Canadian-English speakers (flappers) while playing a video game on Zoom in two conversation settings: with (1) a human non-flapper and (2) an AI non-flapper (computer-generated speech). Acoustic analyses of the productions by human flapper speakers include features specific to stop-flap distinctions as well as global features (e.g., overall duration). In both human- and AI-directed speech, we expect human interlocutors to change flapped productions to stops to enhance intelligibility, particularly late in the conversation. Moreover, we expect differences between human- and AI-directed adaptations, with the former dominantly employing sound-specific features and the latter relying more on global hyperarticulation. Understanding these interlocutor-oriented adaptations may inform the technology behind human-computer interfaces.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call