Compositional Zero-Shot Learning (CZSL) aims to recognize seen and unseen attribute-object compositions. Recently, some researchers apply vision-language models to CZSL task. However, they only roughly match the image embedding and composition embedding on the image level, which can be a barrier to further improvement. With observation and analysis, we believe that a visual primitive is worth a word. To make full use of visual primitives to achieve fine-grained alignment and bridging modal gap, we propose VisPrompt for interacting visual primitives with sub-concepts in a prompt. Specifically, VisPrompt aligns the visual primitives (i.e., visual attribute and visual object) with the sub-concepts (i.e., text attribute and text object) at a fine-grained level. It consists of two steps: (1) First, we extract the visual attribute embedding by an attribute extraction module, and the visual object embedding by an object extraction module; (2) Second, we design an attribute-wise prompt, an object-wise prompt, and a visual reconstructed prompt to be encoded, where a visual primitive plays the role of corresponding sub-concept to interact. Therefore, our model is capable of applying fine-grained alignment and bridging the gap between vision and text. Sufficient experiments on widely-used MIT-States, UT-Zappos, CGQA, and VAW-CZSL datasets show that our VisPromt achieves state-of-the-art on the core metric AUC. Our code will be available.