Efficient multisensory integration is often influenced by other cognitive processes including, but not limited to, semantic congruency and focused endogenous attention. Semantic congruency can re-allocate processing resources to the location of a congruent stimulus, while attention can prioritize the integration of multi-sensory stimuli under focus. Here, we explore the robustness of this phenomenon in the context of three stimuli, two of which are in the focus of endogenous attention. Participants completed an endogenous attention task with a stimulus compound consisting of 3 different objects: (1) a visual object (V) in the foreground, (2) an auditory object (A), and (3) a visual background scene object (B). Three groups of participants focused their attention on either the visual object and auditory sound (Group VA, n = 30), the visual object and the background (VB, n = 27), or the auditory sound and the background (AB, n = 30), and judged the semantic congruency of the objects under focus. Congruency varied systematically across all 3 stimuli: All stimuli could be semantically incongruent (e.g., V, ambulance; A, church bell; and B, swimming-pool) or all could be congruent (e.g., V, lion; A, roar; and B, savannah), or two objects could be congruent with the remaining one incongruent to the other two (e.g., V, duck; A, quack; and B, phone booth). Participants exhibited a distinct pattern of errors: when participants attended two congruent objects (e.g., group VA: V, lion; A, roar), in the presence of an unattended, incongruent third object (e.g., B, bath room) they tended to make more errors than in any other stimulus combination. Drift diffusion modeling of the behavioral data revealed a significantly smaller drift rate in two-congruent-attended condition, indicating slower evidence accumulation, which was likely due to interference from the unattended, incongruent object. Interference with evidence accumulation occurred independently of which pair of objects was in the focus of attention, which suggests that the vulnerability of congruency judgments to incongruent unattended distractors is not affected by sensory modalities. A control analysis ruled out the simple explanation of a negative response bias. These findings implicate that our perceptual system is highly sensitive to semantic incongruencies even when they are not endogenously attended.