Although evidence has shown that working memory (WM) can be differentially affected by the multisensory congruency of different visual and auditory stimuli, it remains unclear whether different multisensory congruency about concrete and abstract words could impact further WM retrieval. By manipulating the attention focus toward different matching conditions of visual and auditory word characteristics in a 2-back paradigm, the present study revealed that for the characteristically incongruent condition under the auditory retrieval condition, the response to abstract words was faster than that to concrete words, indicating that auditory abstract words are not affected by visual representation, while auditory concrete words are. Alternatively, for concrete words under the visual retrieval condition, WM retrieval was faster in the characteristically incongruent condition than in the characteristically congruent condition, indicating that visual representation formed by auditory concrete words may interfere with WM retrieval of visual concrete words. The present findings demonstrated that concrete words in multisensory conditions may be too aggressively encoded with other visual representations, which would inadvertently slow WM retrieval. However, abstract words seem to suppress interference better, showing better WM performance than concrete words in the multisensory condition.