Social communication draws on several cognitive functions such as perception, emotion recognition and attention. The association of audio-visual information is essential to the processing of species-specific communication signals. In this study, we use functional magnetic resonance imaging in order to identify the subcortical areas involved in the cross-modal association of visual and auditory information based on their common social meaning. We identified three subcortical regions involved in audio-visual processing of species-specific communicative signals: the dorsolateral amygdala, the claustrum and the pulvinar. These regions responded to visual, auditory congruent and audio-visual stimulations. However, none of them was significantly activated when the auditory stimuli were semantically incongruent with the visual context, thus showing an influence of visual context on auditory processing. For example, positive vocalization (coos) activated the three subcortical regions when presented in the context of positive facial expression (lipsmacks) but not when presented in the context of negative facial expression (aggressive faces). In addition, the medial pulvinar and the amygdala presented multisensory integration such that audiovisual stimuli resulted in activations that were significantly higher than those observed for the highest unimodal response. Last, the pulvinar responded in a task-dependent manner, along a specific spatial sensory gradient. We propose that the dorsolateral amygdala, the claustrum and the pulvinar belong to a multisensory network that modulates the perception of visual socioemotional information and vocalizations as a function of the relevance of the stimuli in the social context. SIGNIFICANCE STATEMENT: Understanding and correctly associating socioemotional information across sensory modalities, such that happy faces predict laughter and escape scenes predict screams, is essential when living in complex social groups. With the use of functional magnetic imaging in the awake macaque, we identify three subcortical structures-dorsolateral amygdala, claustrum and pulvinar-that only respond to auditory information that matches the ongoing visual socioemotional context, such as hearing positively valenced coo calls and seeing positively valenced mutual grooming monkeys. We additionally describe task-dependent activations in the pulvinar, organizing along a specific spatial sensory gradient, supporting its role as a network regulator.