Pre-trained Visual Language Models (VLMs) like CLIP have shown great potential in the multimodal domain. Among this, using different modal contexts and interaction features to construct prompt can stimulate the model’s prior knowledge circuit more accurately, thus generating better outputs. However, in CLIP, the formal mismatch of textual descriptions between the pre-training and inference phases results in a suboptimal representation ability of prompt, which is detrimental to model alignment learning. Therefore, Region-Attention Prompt (RAP) is proposed, which introduces region features to enrich the semantic representation of prompt. RAP is acquired by the Cross-Attention mechanism between images and texts, and it is essentially a region-level prompt with category-sensitive properties. For each category, RAP adaptively assigns greater attention weight to image regions that are more semantically relevant to the category. Besides, CLIP is equipped with RAP (called RA-CLIP) to improve image classification performance in generalization scenarios. Extensive experiments demonstrate that RA-CLIP outperforms the current SOTA CoCoOp 0.4% - 4.16% on base classes and 0.25% - 11.34% on new classes, across 7 datasets. In addition, we show that focusing on category-related regions to construct prompt can further improve the model’s alignment ability.