Abstract

How do listeners interpret speech input when exposure does not contain information about the talker’s category representations? Competing accounts are rarely contrasted. We implement competing hypotheses within the same general computational framework (Bayesian inference). All models were trained on phonetic productions to predict perception, reflecting the hypothesis that listeners learn category representations from the speech input. We compare them against two experiments on the perception of L1-US English stop voicing (N = 24 and 122). Both experiments used minimal pairs (e.g., tin/din) varying in the primary cue (VOT) while keeping secondary cues (f0 and vowel duration) at expected correlations. VOT values occurred equally often and spanned the range observed in US English. We find that (1) models that integrated perceptual noise performed better in predicting categorization responses than those without; (2) models with multiple cues performed better than those with just VOT; (3) models trained on talker-normalized phonetic cues performed better than those trained on unnormalized cues; and, surprisingly, (4) models that also normalized the novel speech input during the experiment performed worse than those that did not. (3) and (4) suggest that listeners’ long-term representations are based on talker-normalized cues but require *labelled* input—contrary to most normalization accounts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call