When we hear an emotional voice, does this alter how the brain perceives and evaluates a subsequent face? Here, we tested this question by comparing event-related potentials evoked by angry, sad, and happy faces following vocal expressions which varied in form (speech-embedded emotions, non-linguistic vocalizations) and emotional relationship (congruent, incongruent). Participants judged whether face targets were true exemplars of emotion (facial affect decision). Prototypicality decisions were more accurate and faster for congruent vs. incongruent faces and for targets that displayed happiness. Principal component analysis identified vocal context effects on faces in three distinct temporal factors: a posterior P200 (150–250 ms), associated with evaluating face typicality; a slow frontal negativity (200–750 ms) evoked by angry faces, reflecting enhanced attention to threatening targets; and the Late Positive Potential (LPP, 450–1000 ms), reflecting sustained contextual evaluation of intrinsic face meaning (with independent LPP responses in posterior and prefrontal cortex). Incongruent faces and faces primed by speech (compared to vocalizations) tended to increase demands on face perception at stages of structure-building (P200) and meaning integration (posterior LPP). The frontal LPP spatially overlapped with the earlier frontal negativity response; these components were functionally linked to expectancy-based processes directed towards the incoming face, governed by the form of a preceding vocal expression (especially for anger). Our results showcase differences in how vocalizations and speech-embedded emotion expressions modulate cortical operations for predicting (prefrontal) versus integrating (posterior) face meaning in light of contextual details.