Visual-language (V-L) models have achieved remarkable success in learning combined visual–textual representations from large web datasets. Prompt learning, as a solution for downstream tasks, can address the forgetting of knowledge associated with fine-tuning. However, current methods focus on a single modality and fail to fully use multimodal information. This paper aims to address these limitations by proposing a novel approach called visual and text prompt learning (VTPL) to train the model and enhance both visual and text prompts. Visual prompts align visual features with text features, whereas text prompts enrich the semantic information of the text. Additionally, this paper introduces a poly-1 information noise contrastive estimation (InfoNCE) loss and a center loss to increase the interclass distance and decrease the intraclass distance. Experiments on 11 image datasets show that VTPL outperforms state-of-the-art methods, achieving 1.61%, 1.63%, 1.99%, 2.42%, and 2.87% performance boosts over CoOp for 1, 2, 4, 8, and 16 shots, respectively.