Visual cues facilitate speech perception during face-to-face communication, particularly in noisy environments. These visual-driven enhancements arise from both automatic lip-reading behaviors and attentional tuning to auditory-visual signals. However, in crowded settings, such as a cocktail party, how do we accurately bind the correct voice to the correct face, enabling the benefit of visual cues on speech perception processes? Previous research has emphasized that spatial and temporal alignment of the auditory-visual signals determines which voice is integrated with which speaking face. Here, we present a novel illusion demonstrating that when multiple faces and voices are presented in the presence of ambiguous temporal and spatial information as to which pairs of auditory-visual signals should be integrated, our perceptual system relies on identity information extracted from each signal to determine pairings. Data from three experiments demonstrate that expectations about an individual’s voice (based on their identity) can change where individuals perceive that voice to arise from.