People require multimodal emotional interactions to live in a social environment. Several studies using dynamic facial expressions and emotional voices have reported that multimodal emotional incongruency evokes an early sensory component of event-related potentials (ERPs), while others have found a late cognitive component. The integration mechanism of two different results remains unclear. We speculate that it is semantic analysis in a multimodal integration framework that evokes the late ERP component. An electrophysiological experiment was conducted using emotionally congruent or incongruent dynamic faces and natural voices to promote semantic analysis. To investigate the top-down modulation of the ERP component, attention was manipulated via two tasks that directed participants to attend to facial versus vocal expressions. Our results revealed interactions between facial and vocal emotional expressions, manifested as modulations of the auditory N400 ERP amplitudes but not N1 and P2 amplitudes, for incongruent emotional face-voice combinations only in the face-attentive task. A late occipital positive potential amplitude emerged only during the voice-attentive task. Overall, these findings support the idea that semantic analysis is a key factor in evoking the late cognitive component. The task effect for these ERPs suggests that top-down attention alters not only the amplitude of ERP but also the ERP component per se. Our results implicate a principle of emotional face-voice processing in the brain that may underlie complex audiovisual interactions in everyday communication.