In resource-limited keyword spotting scenarios, the scarcity of annotated corpora hinders deep learning’s ability to develop robust models for representing acoustic features. Recent studies focus on contrastive learning, using paired examples for self-supervision, indicating a growing research interest in keyword spotting. Typically, models are trained on segmented spoken words. However, the absence of distinct word boundaries in lengthy audio poses significant challenges. This study presents a novel approach that integrates Contrastive Language–Image Pre-Training with Cross-Attention for Self-Supervised Alignment in the field of Multimodal Keyword Spotting. The proposed method introduces a cross-modal process for matching word pairs, enhancing collaborative effectiveness between audio and text embeddings, achieving word-level congruence across different modalities. Employing a self-supervised learning approach, we use a restricted number of annotated audio–text pairs to discern semantic congruities and divergences, resulting in improved multimodal feature representation. Using this strategy, we input into subsequent keyword detection tasks, adopting a Bidirectional Cross-Attention block for detailed coordination across modalities, bridging the semantic gap between diverse audio and text representations, consequently improving keyword localization precision. Comprehensive experiments conducted on the Aishell-2 and Librispeech datasets demonstrate that the proposed method markedly surpasses existing techniques, especially regarding the comprehensive performance metric F1 for keyword spotting. It attains state-of-the-art outcomes for both pristine and noisy speech, showing respective enhancements of 5.2% and 5.4% on Aishell-2, as well as 5.4% and 5.6% on Librispeech.