Evaluating normalization accounts against the dense vowel space of Stockholm Swedish

Anna Persson,T Florian Jaeger

doi:10.1121/10.0019201

Abstract

Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist —including both vowel-specific (e.g., Lobanov, 1971; Nearey, 1978; Syrdal and Gopal, 1986) and general-purpose accounts applicable to any type of phonetic cue (McMurray and Jongman, 2011). We add to the cross-linguistic literature by comparing normalization accounts against a new database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We train Bayesian ideal observers (IOs) on unnormalized or normalized vowel data under different assumptions about the relevant cues to vowel identity (F0-F3, vowel duration), and evaluate their performance in predicting the category intended by talker. The results indicate that the best-performing normalization accounts centered and/or scaled formants by talker (e.g., Lobanov), replicating previous findings for other languages with less dense vowel spaces. The relative advantage of Lobanov decreased when including additional cues, indicating that simple centering relative to the talker’s mean might be sufficient to achieve robust inter-talker perception (e.g., C-CuRE).

Full Text