Vowel normalization is a computation that is meant to account for the differences in the absolute direct (physical or psychophysical) representations of qualitatively equivalent vowel productions that arise due to differences in speaker properties such as body size types, age, gender, and other socially interpreted categories that are based on natural variation in vocal tract size and shape. We present a virtual environment for vocal learning which provides the means to model the acquisition of vowel normalization, along with other aspects of vocal learning. The environment consists of models of caretaker agents representing five different language communities—American English, Cantonese, Greek, Japanese, and Korean—derived from vowel category perception experiments (Munson et al., 2010, Plummer et al., 2013) and models of infant agents (Plummer, 2012, 2013) that “vocally interact” with their caretakers. Moreover, we develop a model of caretaker social and vocal signaling in response to infant vowel productions, and of an infant's internalization of these signals and the internal computations over them. More broadly, we model the acquisition of vowel normalization within a developmental framework encompassing a suite of vocal learning phenomena, including language-specific caretaker vocal exchanges, perceptual warping, and multisensory matching and narrowing.